Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A tutorial on generalizing the default Bayesian t-test via posterior sampling and encompassing priors

A tutorial on generalizing the default Bayesian t-test via posterior sampling and encompassing... With the advent of so-called “default” Bayesian hypothesis tests, scientists in applied fields have gained ac- cess to a powerful and principled method for testing hypotheses. However, such default tests usually come with a compromise, requiring the analyst to accept a one-size-fits-all approach to hypothesis testing. Further, such tests may not have the flexibility to test problems the scientist really cares about. In this tutorial, I demonstrate a flexible approach to generalizing one specific default test (the JZS t-test; Rouder et al., 2009) that is becoming increasingly popular in the social and behavioral sciences. The approach uses two results, the Savage-Dickey density ratio (Dickey & Lientz, 1980) and the technique of encompassing priors (Klugkist et al., 2005) in com- bination with MCMC sampling via an easy-to-use probabilistic modeling package for R called Greta. Through a comprehensive mathematical description of the techniques as well as illustrative examples, the reader is presented with a general, flexible workflow that can be extended to solve problems relevant to his or her own work. Note: this paper is in press at Communications for Statistical Applications and Methods. Keywords: Bayes factors, Bayesian inference, hypothesis testing, MCMC sampling, JZS t-test, Savage-Dickey density ratio, encompassing priors 1. Introduction The t-test is one of the simplest, yet most enduring, examples of a hypothesis test that the social and behavioral scientist uses in his or her daily work. In the typical framework of null hypothesis significance testing (NHST), the t-test works by first assuming a null hypothesis, and then calcu- lating a t-score, which indexes the likelihood of obtaining some sample of observed data under the null hypothesis. If this probability is small, the scientist rejects the null in favor of some alternative hypothesis. Consider the following scenario that is often used when assessing the e ect of some treatment. Let th x and x denote measurements for the i participant in two di erent conditions (e.g., a pretest and i1 i2 posttest). Consider the di erence d = x x . A typical consideration is whether these di erences i i2 i1 are di erent from 0; answering this question in the armative would then imply that the treatment had some nonzero e ect. To answer this question, one can apply the standard one-sample t-test, which works by first assuming d  Normal(;  ); and then defining two competing hypotheses: a null hypothesis H :  = 0 and an alternative hypoth- esis H :  , 0. We then test H by computing 1 0 t = ; s= n Corresponding author: Assistant Professor, Department of Psychological Sciences, Tarleton State University, Stephenville, Texas, USA. E-mail: faulkenberry@tarleton.edu arXiv:1812.03092v2 [stat.CO] 5 Feb 2019 2 where d is the mean of the di erences d across all participants i = 1; : : : ; n, s is the sample standard deviation of the di erence scores d , and n is the sample size. Under the null H , the distribution of t i 0 is well known as Student’s t distribution, with density function 2 2 f (x) =   1 + ; where  represents degrees of freedom, and x 2 (1;1). The cumulative distribution function F(x) = f (u)du can then be used to index the probability of observing data at least as extreme as that which we observed under the null hypothesis H . Specifically, we compute p(jxj > t) = 1 F(t), a quantity commonly known as a p-value. If this probability is small (say, less than 5%), then one may decide to reject H and conclude that  , 0, thus implying that our treatment had some nonzero e ect. This idea is well known to practicing social and behavioral scientists. However, there are some properties of this procedure that are suboptimal for robust inference. For one, the procedure is asym- metric (Rouder et al., 2009). Suppose that one calculates a p-value above some commonly-used threshold like 5%. What decision does the researcher make? Surely the logical opposite of “reject H ” is “accept H ”. However, this decision rule is inconsistent. The reason for this follows from 0 0 examining the distribution of p-values that result from increasing sample sizes. When the null is false (i.e.,  ,  ), the value of t increases as sample sizes increase. Thus, the probability of rejecting H 1 2 0 increases accordingly. However, if the null is true, p-values are uniformly distributed between 0 and 1, regardless of sample size. So, whereas a false null hypothesis can always be rejected if sample size is large enough, a true null hypothesis is always susceptible to being incorrectly rejected. Such incon- sistency leads to asymmetry in the testing procedure – increasing sample size can increase evidence against a false null hypothesis, but there is no corresponding way to increase evidence for a true null hypothesis. Another criticism of the traditional hypothesis testing procedure is that researchers often misin- terpret the results of such tests. Hoekstra et al. (2014) asked 562 researchers and students from the field of psychology to assess the validity of six di erent statements involving incorrect interpretations of confidence intervals (e.g., “The probability that the true mean is greater than 0 is at least 95%”). Although each of these statements was false, both students and researchers on average believed at least 3 of the statements were true. Furthermore, researchers did no better than students with respect to these misunderstandings. This finding echos results by Oakes (1986), who performed a similar study using statements about p-values; see also Gigerenzer (2004). In light of these criticisms, the social and behavioral sciences have seen an increase in recommen- dations to find alternatives to the orthodox use of null hypothesis significance testing (Wagenmakers et al., 2011). One such alternative is to use Bayesian inference, and in particular Bayes factors (Kass and Raftery, 1995; Raftery, 1995; Masson, 2011). Bayesian inference is based on calculating the posterior probability of a hypothesis H after observing data y. This calculation proceeds by Bayes’ theorem, which states p(y j H ) p(H ) p(H j y) = : (1.1) p(y) One way to think of Equation 1.1 is as follows: prior to observing data, one assigns a prior belief p(H ) to a hypothesis H . Once the data y have been observed, one updates this prior belief to a posterior belief p(H j y) by multiplying the prior p(H ) by the likelihood p(y j H ). This product is then rescaled to meet the requirements for being probability distribution (i.e., total probability = 1) by dividing by p(y), the marginal probability of the observed data averaged across all possible hypotheses H . 3 While this computation is fundamentally quite basic, one immediate consequence is how it can be used for comparing two hypotheses. Suppose as above that we have two competing hypotheses: a null hypothesisH and an alternative hypothesisH . We can directly compare our posterior beliefs in 0 1 these two hypotheses by computing their ratio p(H j y)= p(H j y), which we call the posterior odds 0 1 for H over H . Using Bayes’ theorem (Equation 1.1), we can readily see 0 1 p(H j y) p(y j H ) p(H ) 0 0 0 =  : (1.2) p(H j y) p(y j H ) p(H ) 1 1 1 | {z } | {z } | {z } posterior odds Bayes factor prior odds As with Bayes’ theorem, Equation 1.2 can also be interpreted in terms of an “updating” metaphor. Specifically, the posterior odds ratio is equal to the prior odds ratio multiplied by an updating factor. This updating factor is the ratio of the marginal likelihoods p(y j H ) and p(y j H ), and is called 0 1 the Bayes factor (Je reys, 1961; Kass and Raftery, 1995). The Bayes factor is the weight of evidence provided by data y. For example, suppose that one assigned the prior odds of H to H to be equal to 0 1 4-to-1; that is, we believe that, a priori, H is 4 times more likely to be true than H . Then, suppose 0 1 that after observing data, we compute a Bayes factor was computed to be 5. Now, the posterior odds (the odds of H over H after observing data) is 20-to-1 in favor of H over H . 0 1 0 1 There are two immediate advantages to using the Bayes factor for inference. First, the Bayes factor is a ratio, and thus, is subject to a natural interpretation. Simply put, larger is better - the bigger the Bayes factor, the bigger the weight of evidence provided by the observed data. Second, since there was no specific assumption about the order in which we addressed H and H , we could have just as 0 1 easily measured the weight of evidence in favor of H over H . In fact, once we have a Bayes factor 1 0 in favor of one hypothesis, a simple reciprocal will give us the Bayes factor in favor of the the other hypothesis. In our example above, the Bayes factor for H over H would have been 1/5, implying 1 0 that the data would actually decrease our relative belief in H over H . Because we can compute 1 0 Bayes factors from either direction, we must be careful to define our notation carefully. In this paper, I will adopt the common convention to define B as the Bayes factor for H over H . Similarly, B 01 0 1 10 would represent the Bayes factor for H over H . Note that, by our discussion above, B = 1=B . 1 0 01 10 Though the previous discussion certainly speaks positively about the benefits of using the Bayes factor as a tool for inference, there are some important considerations that the researcher must address before implementing it as a tool for inference. First, as we’ll see in the discussion below, the Bayes factor requires the analyst to specify prior distributions on all parameters in the underlying model. Thus, a given Bayes factor reflects a specific choice of prior. Second, the computation of these Bayes factors is usually nontrivial. For example, computing the either the numerator or denominator of Equation 1.2 requires explicitly defining the hypothesisH as a model consisting of vectors  in some parameter space  and integrating the likelihood weighted by a prior distribution on these parameters; that is, p(y j H ) = f (y j ;H ) p( j H )d; i i i where f is the likelihood function, and p denotes the prior distribution on parameters  2  under model H . However, the last decade has seen the development of many tools intended to simplify the calculation of Bayes factors for the common models used by applied researchers, including online calculators, standalone software packages such as JASP (JASP Team, 2018), and a wide range of packages for the statistical computing environment R (R Core Team, 2018), including the package BayesFactor (Morey and Rouder, 2018). Because these solutions are designed to work across a variety 4 of contexts, one must necessarily assume some defaults with respect to the models that underly these calculators. Many have argued that these defaults represent reasonable assumptions about the types of problems with which many applied researchers are concerned. However, recent advances in statistical computing have made it easier for the practicing researcher to build his or her own custom models for a given situation and compute Bayes factors to compare these models. In this paper, I will provide a tutorial with particular focus on extending one type of Bayesian model comparison known as the Je reys-Zellner-Siow t-test (Rouder et al., 2009) (henceforth abbre- viated as JZS t-test). Specifically, I will describe a generalization that provides an adaptable, compu- tationally ecient method for computing Bayes factors in a variety of single-sample and independent- samples designs. The organization of the paper is as follows. First, I will describe the mathematical underpinnings of the JZS t-test. Then, I will present two results which allow us to generalize the JZS t-test to a broader class of model comparisons: (1) the Savage-Dickey density ratio (Dickey and Lientz, 1970; Wagenmakers et al., 2010; Wetzels et al., 2009), which is used for comparing models in which one is a sharp hypothesis (e.g, a point null hypothesis) nested within another unconstrained model; and (2) the encompassing prior technique (Klugkist et al., 2005), which is used for comparing nested models with ordinal constraints. Finally, I will demonstrate (with examples) how to use these techniques along with posterior sampling (Gelfand and Smith, 1990) to compute Bayes factors for model comparisons involving both point-null hypotheses (i.e., testing whether an e ect is exactly 0), directional hypotheses (i.e., testing whether an e ect is postive compared to whether it is negative), and interval-null hypotheses (i.e., testing whether an e ect is approximately 0). 2. The JZS t-test The JZS t-test (Rouder et al., 2009) was developed as a default Bayesian version of the orthodox t-test described above. We will denote by y a vector of observed data, and assume as above that y is normally distributed with mean  and variance  . We can then explicitly define two competing hypotheses: a null hypothesis H and an alternative hypothesis H . Both hypotheses can be parameterized with 0 1 two parameters,  and  . Under the alternative hypothesisH , we allow  and  to freely vary. That is, H is an unconstrained model;  could be positive, negative, or zero. For the null hypothesis H , 1 0 which states that the mean of data y is equal to 0, we can simply constrain  to be 0. Thus, we say that H is nested within H . Under these models, we can compute the Bayes factor B = p(y j H )= p(y j 0 1 01 0 H ), where 2 2 2 p(y j H ) = f (y j  = 0;  ;H ) p( ;H )d 0 0 0 and Z Z 1 1 2 2 2 p(y j H ) = f (y j ;  ;H ) p(;  ;H )d d: 1 1 1 1 0 2 2 These computations require placing priors on  under the null model H and both  and  under the alternative model H . Following Je reys (1961) and Zellner and Siow (1980), Rouder et al. reparameterized the problem by placing a Cauchy prior on e ect size  = =. That is, under H , Cauchy(0; r), where r represents the scale of expected e ect sizes, and underH ,  = 0. Rouder et 2 2 2 al. placed a Je rey’s prior on  ; specifically, p( ) / 1= . With these default prior specifications, Rouder et al. (2009) showed that the Bayes factor can be computed as t 2 1 + B = ; (2.1) 1 1 2 1 3 t 1 2 2 2 (1 + Ngr ) 1 + (2) g exp dg 2g 0 (1+Ngr ) 5 where t is the orthodox t statistic, r is the scale on the e ect size prior, N is the number of observations in y, and  = N 1 denotes the degrees of freedom. Though computationally convenient, this JZS Bayes factor formula has a few disadvantages. First, it reflects a very specific choice of prior specification. Though using a Cauchy prior on e ect size may be a reasonable choice for many researchers, especially in the behavioral sciences (Rouder et al., 2009), others may argue for a di erent prior. For example, Killeen (2007) used meta-analytic data from Richard et al. (2003) to argued that e ect sizes in social psychology are typically normally distributed with variance equal to 0.3. Certainly, one advantage of a Bayesian approach is that the prior on e ect size should reflect the analyst’s prior belief on what e ect sizes should be expected. For some fields of study, these e ect sizes may be expected to be small (i.e., social psychology), whereas in other fields, these e ect sizes may be expected to be larger. Also, note that the reason for using the Cauchy prior (instead of a normal prior) in the JZS Bayes factor is one of computational convenience; it simply makes the computation work out. If the analyst wants to use a di erent prior, the formula for the JZS Bayes factor in Equation 2.1 would have to be recomputed, a task which would only be accessible to those researchers with the appropriate mathematical background. Another disadvantage of the JZS Bayes factor is that it forces the analyst into a very specific hypothesis test; that is, H :  = 0 versus H :  , 0. One may be interested instead in more flexible 0 1 testing situations – for example, testing a directional hypothesis H :  > 0 versus H :   0, or 0 1 an interval hypothesis such as H : " <  < " versus an alternative model H : jj > " (Morey 0 1 and Rouder, 2011). These tests would each require a major readjustment to the derivation of the JZS Bayes factor. Instead, I propose that we approach problems like these using a fundamentally di erent set of tools, which I will now describe. 3. Generalizing the JZS t-test In this section, I describe a method for extending the default JZS t-test of Rouder and colleagues to use a wider class of priors on e ect size. This generalization thus allows the analyst more freedom to specify a prior that may better reflect his or her a priori expectation about what e ect sizes are typically encountered in a given field. The original description of this method is due to Wetzels et al. (2009) and Wagenmakers et al. (2010), though the software implementation and specific extensions I will describe are novel. The core method relies on a result known as the Savage-Dickey density ratio (Dickey and Lientz, 1970), which states that the Bayes factor for a pair of models in which one of the models is a one- point restriction of the other (i.e., H :  = 0 versus H :  , 0) is simply the ratio of the ordinates 0 1 of the one point of interest (i.e.,  = 0) in the posterior and prior densities, respectively. The technical formulation and proof will be presented momentarily. For now, however, one should appreciate that this can be a great simplification over other methods of computing Bayes factors presented above, since there is no need for integration. Assuming one can estimate the prior and posterior densities via some sampling method (e.g., Markov chain Monte Carlo, or MCMC, sampling), then the computation of this ratio of densities is straightforward. The Savage-Dickey density ratio can stated rigorously as the following proposition: Proposition 1. (Savage-Dickey Density Ratio) Consider two competing models on data y containing parameters  and ', namelyH :  =  ; ' andH : ; '. In this context, we say that  is a parameter 0 0 1 of interest, ' is a nuisance parameter (i.e., common to all models), andH is a sharp point hypothesis nested within H . Suppose further that the prior for the nuisance parameter ' in H is equal to the 1 0 6 prior for ' in H after conditioning on the restriction – that is, p(' j H ) = p(' j  =  ;H ). Then 1 0 0 1 p( =  j y;H ) 0 1 B = : p( =  j H ) 0 1 Proof: By definition, the Bayes factor is the ratio of marginal likelihoods over H and H , respec- 0 1 tively. That is, p(y j H ) B = : (3.1) p(y j H ) The key idea in the proof is that we can use a “change of variables” technique to express B entirely in terms ofH . This proceeds by first unpacking the marginal likelihood forH over the nuisance pa- 1 0 rameter ' and then using the fact thatH is a sharp hypothesis nested withinH to rewrite everything 0 1 in terms of H . Specifically, p(y j H ) = p(y j ';H ) p(' j H )d' 0 0 0 = p(y j ';  =  ;H ) p(' j  =  ;H )d' 0 1 0 1 = p(y j  =  ;H ): 0 1 By Bayes’ Theorem, we can rewrite this last line as p( =  j y;H ) p(y j H ) 0 1 1 p(y j  =  ;H ) = : 0 1 p( =  j H ) 0 1 Thus we have p(y j H ) 1 B = = p(y j H ) 01 0 p(y j H ) p(y j H ) 1 1 = p(y j  =  ;H ) 0 1 p(y j H ) p( =  j y;H ) p(y j H ) 1 0 1 1 p( =  j H ) p(y j H ) 0 1 1 p( =  j y;H ) 0 1 = : p( =  j H ) 0 1 The beauty of Proposition 1 is that it allows one to calculate the Bayes factor for a point null hypothesis (i.e, H :  = 0) by simply computing two densities: (1) the density of  = 0 in the posterior, and (2) the density of  = 0 in the prior. Then, the Bayes factor results by taking the ratio of these posterior and prior densities, respectively. Given this result, this changes the problem of computing Bayes factors from one of integration (e.g, the JZS Bayes factor) to one of estimating prior and posterior densities. 7 3.1. Computing the Savage-Dickey density ratio The Savage-Dickey density ratio is an elegant solution to the problem of computing Bayes factors in situations involving a point null hypothesis H . All that is required is that one can compute samples from the posterior of an e ect size parameter under a specified alternative model H . I think casting the problem in the context is preferable, not only for its flexibility, but especially given the broad class of computer methods now available for sampling posteriors in Bayesian models, including BUGS (Lunn et al., 2000), JAGS (Plummer, 2003), and Stan (Carpenter et al., 2017). I will now focus on one recent addition to this collection – Greta (Golding, 2018). Greta is an R package designed for sampling from Bayesian models. It provides a reasonably simple language for modeling that is implemented directly within R, eliminating the need for writ- ing models in another language (e.g., JAGS, Stan) and then having to call these external files from within R. Further, Greta uses the computational power of Google TensorFlow (Abadi et al., 2015), so it provides fast convergence based on Hamiltonian Monte Carlo sampling (Neal, 2011), it scales well to very large datasets, and it can even be configured to run on GPUs, providing the ability for massive parallel computation. Moreover, it is a free download from the Comprehensive R Archive 1; Network (CRAN) , and as such can be installed directly from within R by typing the command install.packages(``greta'') at the R console. Note that fitting models with Greta will require the user to have a working installation of Python packages for TensorFlow (version 1.10.0 or higher) and tensorflow-probability (version 0.3.0 or higher). Once Greta is installed, the startup message will provide the user with system-specific instructions on how to install these two packages. While this step can be tricky, most errors can be addressed by following the recommendations on the Greta help 2; 3; page and the TensorFlow help page . To illustrate how Greta works, we will first look at a model inspired by that which was initially described by Wetzels et al. (2009) as an alternative to the JZS t-test. As is common in these types of models, the model is depicted as a graphical model (Gilks et al., 1994) in Figure 1. In such graphical models, we use the various nodes to represent all variables of interest. Dependencies between these variables are indicated with graph structure. Deterministic nodes are denoted as rhombuses (i.e., rotated squares), whereas stochastic nodes are represented by unshaded circles. Finally, we denote observed variables by shaded nodes. In this model, our data y is assumed to be drawn from a normal distribution with mean  and variance  . As in the discussion of the JZS t-test above, we consider e ect size  = = as our main parameter of interest. For a fully Bayesian specification, we must place priors on  and . For this first example, we follow Rouder et al. (2009) and Wetzels et al. (2009) and adopt their recommendations of placing a half-Cauchy prior on  (that is, one of the symmetric halves of the Cauchy(0,1) distribution that is defined for positive numbers only). Critically, we assume that e ect size  is distributed as Cauchy(0,1); the scale value of 1 indicates that, a priori, we believe that 50% of our e ect sizes would lie between -1 and 1. Of course, as we’ll see below, if this doesn’t reflect the analyst’s prior belief about , this prior can be easily changed. This choice of model allows us to define two competing hypotheses: H :  = 0 H :   Cauchy(0; 1) https://CRAN.R-project.org/package=greta https://greta-stats.org/articles/get started.html https://tensorflow.rstudio.com/tensorflow/articles/installation.html 8 δ ∼ Cauchy(0, 1) μ = δσ σ ∼ Cauchy(0, 1) y ∼ Normal(μ, σ ) Figure 1: A graphical model for a posterior sampling Bayesian t-test. Table 1: Example data for a single-sample t test Patient 1 2 3 4 5 6 7 8 9 10 Hours gained 0.7 -1.1 -0.2 1.2 0.1 3.4 3.7 0.8 1.8 2.0 AsH is a sharp hypothesis nested withinH , we can apply Proposition 1 (the Savage-Dickey density 0 1 ratio) and compute p( = 0 j y;H ) B = : p( = 0 j H ) To do this, we’ll need to draw samples from the posterior distribution of  under H and estimate the height of  = 0 in an estimated density function for this posterior. All of this can be done in R, as I will now illustrate. To begin, let us consider a simple example, the type of which can be found in most elementary statistics textbooks. This example comes from Hoel (1984). Suppose that 10 patients take part in an experiment on a new drug that is supposed to increase sleep in the patients. Table 1 shows the hours of sleep gained by each patient (negative values indicate lost sleep). Assuming that the sample data are normally distributed with mean  and variance  , we can test the hypothesis H :  = 0 against H :  , 0. The R code necessary to perform a Bayesian t-test on this data is displayed in Listing 1. The first step will be to load the Greta library (see line 2). After this, we need to assign our sample data to a vector y and then convert these to z-scores (see lines 5-6). The next step is to define the prior distributions on  and . The Greta syntax allows this to be done in a quite straightforward manner (see lines 9-10). Further, any deterministic operations should then be defined, as we do in line 13. Then, we can define our likelihood for the z-scores. The wording of the syntax has a nice advantage here, as it describes exactly what we are assuming about our scores; namely, that they are normally distributed with mean  and variance  (see line 16). The last step in setting up the model is to define the model; that is, we collect all of the variables of interest in our analysis. We have three: , , and , which we collect together in line 19. Now we are finally ready to sample from the posterior distributions of the variables in our model. We will focus our interest on delta, but Greta will automatically sample all posteriors for us. This step, displayed in line 22, will take a little while, depending on computing resources. Once the sampling is complete, there are two ways to inspect the samples before proceeding to our inference. The first is to type summary(draws); this will show us various descriptive statistics 9 Listing 1: Building and sampling from the single-sample t-test model 1 # load libraries 2 library(greta) 4 # data from Hoel (1984) 5 y = c(0.7,-1.1,-0.2,1.2,0.1,3.4,3.7,0.8,1.8,2.0) 6 z = y/sd(y) 8 # priors 9 delta = cauchy(0,1) 10 sigma = cauchy(0,1,truncation = c(0,Inf)) 12 # operations 13 mu = delta*sigma 15 # likelihood 16 distribution(z) = normal(mu,sigma) 18 # define model 19 m = model(mu, sigma , delta) 21 # draw samples 22 draws = mcmc(m, n_samples=5000) of the samples, including mean, standard deviation, standard error, and quantiles. In our example, there will be three lines of output; one for each of , , and . Another way to look at the samples is to inspect the path of the samples over time as they explore the posterior distributions. This is done 4; by using the mcmc_trace command from the bayesplot package (Gabry and Mahr, 2018). If the samples converged appropriately, one should see the characteristic “hairy caterpillar” plot, indicating that the chains mixed well and truly randomly explored the posteriors. We will now look at how to compute Bayes factors necessary to compare the models H and H . 0 1 We will do this by computing the Savage-Dickey density ratio. Recall from Proposition 1 that in order to compute B , we simply need to compute the ordinate of  = 0 in the densities of the prior and posterior, and then take their ratio. We already know the density function for the prior, and it is implemented in R as the dcauchy function. However, since we are using samples to approximate the posterior, we need a way to estimate its density function from the samples. One such method is to use a logspline density estimator (Kooperberg and Stone, 1992; Stone et al., 1997), which is implemented 5; by the function logspline from the R package polspline (Kooperberg, 2018). Listing 2 shows the R code necessary to both (1) plot the ordinates from the prior and posterior densities for  = 0 under H , and (2) compute BF as the ratio of these ordinates. The first step is 1 01 to extract the relevant samples from the posterior for  from the object draws (see line 4). Note that if the analyst is interested in other parameters (e.g.,  or , the [,3] part of line 4 can be adjusted appropriately. We can then perform the logspline estimate of the posterior density for , as shown on line 5. https://CRAN.R-project.org/package=bayesplot https://CRAN.R-project.org/package=polspline 10 Table 2: Example data for a two-sample t-test. Raw 62 60 56 63 56 63 59 56 44 61 Roasted 57 56 49 61 55 61 57 54 62 58 From here, there are two paths worth exploring. First, I typically will produce a plot showing the components of the Savage-Dickey density ratio. Such a plot will consist of (1) a plot of the prior density for  under H (the Cauchy(0,1) distribution); (2) a plot of the posterior density (from the logspline estimate); and (3) the ordinates of  = 0 in both of these densities. Lines 7-21 will produce such a graph, which can be seen in Figure 2. Next, we can compute the Savage-Dickey density ratio using the code on lines 24-30. Lines 24-25 compute the specific ordinates required. posterior represents the ordinate of  = 0 in the posterior distribution. Since the posterior was estimated from the logspline function, we call this estimate for our calculation using dlogspline along with the object name of our estimate from line 5 (i.e, fitPost). prior represents the ordinate of  = 0 under H ; computing this uses a simple call to the dcauchy function. Finally, we can divide posterior by prior to compute B , which is denoted in line 26 by BF01. Lines 29 and 30 then simply display both B (the Bayes factor in favor of H 01 0 over H ) and B (the Bayes factor in favor of H over H ). From this computation, one can see that 1 10 1 0 we have moderate support for H , as B  0:4. The intuition for this can be had by looking at how 1 01 density at  = 0 changes from prior to posterior. In Figure 2, the posterior density of  = 0 decreases relative to the prior density, indicating that our belief in a null e ect decreases after observing data x. Equivalently, we can use the reciprocal to compute B  2:6, indicating that our data are 2.6 times more likely under the alternative model H compared to the null model H , giving us positive 1 0 evidence in favor of a nonzero e ect . Now, suppose the analyst had a di erent a prior belief about the e ect sizes he or she would expect in this context. For illustration, let us suppose that  is normally distributed with mean  = 0 and variance  = 0:3, as recommended by Killeen (2007). In this case, all of the above code could be run again with some minor changes. First, one would need to change line 9 in Listing 1 to delta = normal(0,sqrt(0.3)). Also, any line in Listing 2 that uses dcauchy(x,0,1) would need to be replaced by dnorm(x,0,sqrt(0.3)). After this minor change, the resulting Bayes factor would be B  0:25, or equivalently, B  4:0. Note that this Bayes factor is a bit larger than the 01 10 previous one in which the Cauchy prior was used for . The reason folllows from the fact that the Cauchy prior is more dispersed relative to a normal prior, and thus with a normal prior, we have a relatively greater prior mass on smaller e ects. Particularly, the ordinate of  = 0 is larger in the normal prior compared to the Cauchy prior. The result is that the ratio between the ordinates of  = 0 in the prior and posterior becomes larger for the normal prior, thus giving us a larger Bayes factor. 3.2. Using the Savage-Dickey density ratio for a two-sample design The methods described above will readily scale up to problems involving two independent samples. All that is required is that the underlying model is adjusted accordingly. I will illustrate this with another example from Hoel (1984). Consider a sample of 20 rats, each of which receives their main source of protein from either raw peanuts or roasted peanuts. To compare weight gains as a function of protein source, a researcher randomly assigns 10 rats to receive only raw peanuts and 10 rats to receive only roasted peanuts. The resulting weight gains (in grams) are displayed in Table 2. First, we must consider the underlying model. As with the single-sample example, we can repre- sent this model as a directed acyclic graph, which is shown in Figure 3. In this model (inspired by 11 Listing 2: Plotting and computing the Savage-Dickey density ratio 1 Library(polspline) 3 # extract draws from MCMC object and fit a density estimate 4 posteriorDelta = draws[[1]][,3] 5 fitPost = logspline(posteriorDelta) 7 x = seq(-2,2,length.out=1000) 9 # plot density of prior and posterior together 10 plot(x, dlogspline(x, fitPost), type="l", main="", xlab="delta", 11 ylab="density", xlim=c(-2,2), lwd=2, lty=1) 13 # add prior 14 lines(x, dcauchy(x,0,1), lwd=2, lty=3) 16 # add points at 0 for both prior and posterior 17 points(0, dlogspline(0, fitPost), pch=19) 18 points(0, dcauchy(0,0,1), pch=19) 20 legend(-2,0.8, legend=c("Prior density","Posterior density"), 21 lty=c(3,1), lwd=c(2,2), bty="n") 23 # compute SD density ratio 24 posterior <- dlogspline(0, fitPost) 25 prior <- dcauchy(0,0,1) 26 BF01 <- posterior/prior 28 # display both Bayes factors 29 BF01 30 1/BF01 Wetzels et al., 2009), both independent samples x and y are assumed to be drawn from two normal distributions with shared variance  . The mean of the parent distribution of x is  + =2 and the mean for the parent distribution of y is  =2. With this parameterization, represents the “e ect” or di erence between the two populations. As with the single-sample example, we then scale this ef- fect to a standardized e ect  = =. Also, standard Cauchy priors are placed on , , and a truncated Cauchy prior is placed on . For concreteness, let us denote the sample of weight gains from the raw peanut diet as x and the weight gains from roasted peanuts as y. Given the model in Figure 3, our goal is to sample from the posterior distribution of . The R code necessary to perform this sampling is displayed in Listing 5. The procedure is similar to the single-sample model in Listing 1, but there are some notable modifications that are particular to the independent samples model. First, we need to rescale the raw data vectors x and y to z-scores. Since we are assuming shared variance, it suces to base both z- score transformations on only one of x and y. In lines 9-10, I have chosen to base the z-scores on x, but note that similar results will be obtained if instead the researcher chooses to base all z-scores on y. Lines 13-15 define the priors that we assigned to the parameters in our model. Lines 18-24 reflect our 12 Prior density Posterior density −2 −1 0 1 2 delta Figure 2: A plot showing the necessary components for computing the Savage-Dickey density ratio. Included are the density of the prior for  under H (i.e., a Cauchy(0,1) distribution, depicted as a dashed line) as well as the logspline density estimate for the posterior of  (depicted as a solid line). The Bayes factor B can be computed as the ratio of the ordinates of  = 0 under the posterior and prior, respectively. δ ∼ Cauchy(0, 1) μ ∼ Cauchy(0, 1) α σ σ ∼ Cauchy(0, 1) α = δσ x ∼ Normal(μ + α/2, σ ) x 2 y ∼ Normal(μ − α/2, σ ) Figure 3: A graphical model for the posterior sampling independent samples t-test. assumption that data x and y are randomly drawn from two normal distributions centered at  + =2 and  =2, respectively. In lines 27 and 30 we tell Greta to pull 5000 samples from the posterior distribution of ; note that for simplicity, I have only included delta in the model, though one could add any other variable in the model if desired. These posterior samples are then extracted into the density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 13 Listing 3: Building and sampling from the independent-samples t-test model 1 # load libraries 2 library(greta) 4 # data from Hoel (1984) 5 x = c(62, 60, 56, 63, 56, 63, 59, 56, 44, 61) 6 y = c(57, 56, 49, 61, 55, 61, 57, 54, 62, 58) 8 # rescale so that x has mean=0 and sd=1 9 zx = (x-mean(x))/sd(x) 10 zy = (y-mean(x))/sd(x) 12 # priors 13 delta = cauchy(0,1) 14 mu = cauchy(0,1) 15 sigma = cauchy(0,1,truncation = c(0,Inf)) 17 # operations 18 alpha = delta*sigma 19 mux = mu + alpha/2 20 muy = mu - alpha/2 22 # likelihood 23 distribution(zx) = normal(mux,sigma) 24 distribution(zy) = normal(muy,sigma) 26 # define model 27 m = model(delta) 29 # MCMC sample 30 draws = mcmc(m, n_samples = 5000) 32 # extract draws from MCMC object 33 posteriorDelta = draws[[1]][,1] vector posteriorDelta in line 33. After completing the code in Listing 5, the Savage-Dickey density ratio can be plotted and com- puted as we did earlier in Listing 2. Figure 4 shows this ratio graphically; indeed, note that the poste- rior density of  = 0 increases relative to the prior density. This ratio is computed to be B = 2:92, indicating that the data are 2.92 times more likely under H than under H . Thus, we can conclude 0 1 positive support for a null e ect of peanut type on rats’ weight gain. 4. Extension: using encompassing priors for inequality constraints In the previous section, I described an extension of the JZS t-test that uses MCMC sampling to ap- proximate the posterior distribution of e ect size . This method works for sharp hypotheses (i.e., a point null, such as H :  = 0) by employing the Savage-Dickey density ratio, which reduces the calculation of the Bayes factor B into a simple ratio based on the ordinates of the point  = 0 in both 01 14 Prior density Posterior density −2 −1 0 1 2 delta Figure 4: A plot showing the necessary components for computing the Savage-Dickey density ratio in the independent-samples model. Included are the density of the prior for  under H (i.e., a Cauchy(0,1) distri- bution, depicted as a dashed line) as well as the logspline density estimate for the posterior of  (depicted as a solid line). The Bayes factor B can be computed as the ratio of the ordinates of  = 0 under the posterior and prior, respectively. the prior and posterior distributions for . Consider again the sleep example above. What if instead the researcher wanted to whether the new drug increased sleep in patients? This would require the ability to test a directional hypotheses H :  > 0 against H :   0. At first glance, this seems like quite a di erent problem, as the 1 0 Savage-Dickey density ratio does not directly apply to models with inequality constraints. However, there is a method due originally to Klugkist et al. (2005) that fits with this type of problem. In their approach, Klugkist et al. cast such problems as one of testing models with inequality constraints nested within an encompassing model. In this context, both hypotheses H and H are considered as 0 1 specific inequality constraints nested within an encompassing modelH : , where  is unconstrained (i.e.,  2 R). The Klugkist et al. approach (which I will hereafter call the encompassing approach) amounts to using MCMC samples to calculate p(y j H ) B = 0e p(y j H ) and p(y j H ) B = : 1e p(y j H ) density 0.0 0.2 0.4 0.6 0.8 1.0 15 Once these two Bayes factors are computed, one can use transitivity of Bayes factors to compute B = B  B = B  : 01 0e e1 0e 1e The mechanics of the encompassing approach can be summarized in the following proposition: Proposition 2. Consider two models H and H , where H is nested within an encompassing model 1 e 1 H via an inequality constraint on some parameter , and  is unconstrained under H . Then e e c 1=d B = = 1e d 1=c where 1=d and 1=c represent the proportions of the posterior and prior of the encompassing model, respectively, that are in agreement with the inequality constraint imposed by the nested model H . Proof: Consider first that for any modelH on data y with parameter vector , Bayes’ theorem implies f (y j ;H ) p( j H ) t t p( j y;H ) = : p(y j H ) Thus, we can write the marginal likelihood for y under H as f (y j ;H ) p( j H ) t t p(y j H ) = : p( j y;H ) Taking the ratio of the marginal likelihoods for H and the encompassing model H yields the fol- 1 e lowing Bayes factor: f (y j ;H ) p( j H )= p( j y;H ) 1 1 1 B = : 1e f (y j ;H ) p( j H )= p( j y;H ) e e e Now, both the constrained model H and the encompassing model H contain the same parameters 1 e . Choose a specific value of , say  , that exists in both models H and H (we can do this because 1 e 0 0 0 H is nested within H . Then, for this parameter value  , we have f (y j  ;H ) = f (y j  ;H ), so 1 e 1 2 the expression for the Bayes factor reduces to an expression involving only the priors and posteriors for  under H and H : 1 e 0 0 p( j H )= p( j y;H ) 1 1 B = : 1e 0 0 p( j H )= p( j y;H ) e e BecauseH is nested withinH via an inequality constraint, the prior p( j H ) is simply a truncation 1 e 1 0 0 of the encompassing prior p( j H ). Thus, we can express p( j H ) in terms of the encompassing e 1 prior p( j H ) by multiplying the encompassing prior by an indicator function over H and then e 1 normalizing the resulting product. That is, p( j H ) I e  2H p( j H ) = 0 0 p( j H ) I 0 d e  2H 2H = R  p( j H ); 0 0 p( j H ) I 0 d e  2H 1 16 where I 0 is an indicator function. For parameters  2 H , this indicator function is identically 2H 1 equal to 1, so the expression in parentheses reduces to a constant, say c, allowing us to write 0 0 p( j H ) = c p( j H ): 1 e By similar reasoning, we can write the posterior as I 0 2H 0 0 0 p( j y;H ) =  p( j y;H ) = d  p( j y;H ): 1 e e 0 0 p( j y;H )I 0 d e  2H This gives us 0 0 c p( j H )=d  p( j y;H ) c 1=d e e B = = = : 1e 0 0 p( j H )= p( j y;H ) d 1=c e e Note that by definition, 1=d represents the proportion of the posterior distribution for  under the encompassing model H that agrees with the constraints imposed by H . Similarly, 1=c represents e 1 the proportion of the prior distribution for  under the encompassing model H that agrees with the constraints imposed by H . It might seem a bit odd to represent the fraction c=d in the form (1=d)=(1=c). However, this is again done for a computational advantagem, as we can use MCMC sampling to easily estimate the proportions 1=d and 1=c. Also note that in some sense, the encompassing prior approach of Klugkist et al. (2005) is a generalized version of the Savage-Dickey density ratio. Indeed, Wetzels et al. (2010) proved that under “about equality” constraints (e.g., a constrained model M : " <  < " for " > 0), the Bayes factor derived from the encompassing approach tends toward the Bayes factor (for the point null where  = 0) obtained from the Savage-Dickey density ratio as " ! 0. 4.1. Computing Bayes factors with the encompassing approach To illustrate the computation of Bayes factors with the encompassing approach, let us consider the problem mentioned immediately above – suppose we wanted to test whether the drug that we ad- ministered to sleep patients actually increased the patients’ sleep. Specifically, we wish to compare H :   0 against H :  > 0. We will do this by considering both H and H as models with 0 1 0 1 inequality constraints nested with an encompassing model H : , where  is unconstrained. Then, we can use transitivity to compute B = B  B = B =B . 10 1e e0 1e 0e The R code necessary to perform these computations is in Listing 4. As the encompassing model is defined identical to that from Figure 1, the code assumes that we have already drawn samples from that model, as we did in Listing 1. Note that just like our previous computations with the Savage- Dickey density ratio, using the encompassing approach requires that we sample from the posterior of under the unconstrained, encompassing model H . Though the notation is di erent, this is exactly the same posterior distribution that we sampled from in Listing 1. The key steps in Listing 4 are as follows. First, we will compare H to the encompassing model H . To this end, we need to compute the proportion of posterior samples from the encompassing model that are in agreement with the inequality constraint imposed by H (this is the quantity 1=d in the proof of Proposition 2). We say that such samples are “evidential” of H . The R code that will compute this proportion is in line 7. Then, we need to compute the proportion of evidential samples in the prior (i.e., 1=c). Since the prior has known density   Cauchy(0; 1), we can use the pcauchy command to directly compute the proportion of values  that are less than 0; this computation proceeds 17 Listing 4: Computing a Bayes factor for a directional hypothesis using the encompassing approach 1 # directional hypothesis using encompassing priors 2 # H_0: "null" model: delta <= 0 3 # H_1: "alternative" model: delta > 0 4 # H_e: "encompassing" model: delta ~ Cauchy (unconstrained) 6 # H_0 versus H_e 7 postEvidential = mean(posteriorDelta <=0) 8 priorEvidential = pcauchy(0,0,1) 9 BF0e = postEvidential/priorEvidential 11 # H_1 versus H_e 12 postEvidential = mean(posteriorDelta >0) 13 priorEvidential = 1-pcauchy(0,0,1) 14 BF1e = postEvidential/priorEvidential 16 # H_1 versus H_0 17 BF10 = BF1e/BF0e; BF10 in line 8. Then, by Proposition 2, we can simply divide these two quantities to compute B (see line 0e 9). Next, we do a similar computation with H versus the encompassing model H , shown in lines 1 e 12-14. This gives us a value for B . Now, we can compute the Bayes factor for H over H by 1e 1 0 computing B =B  65, indicating that the observed data are approximately 65 times more likely 1e 0e under the alternative model H :  > 0 compared to the null model H :   0. 1 0 This approach can be extended to test a wide variety of hypotheses involving inequality con- straints. One particular advantage of the encompassing approach is that it gives us the ability to test interval null hypotheses – that is, hypotheses of the form H : " <  < ". To illustrate, consider the analyst who is not interested in whether an e ect is exactly 0, but rather, is interested in whether an e ect is larger than threshold, say " = 0:2. An example of such computation is displayed in Listing 5. Like in the example above, we define three hypotheses: two competing hypotheses H : jj < " and H : jj > ", both nested within an 0 1 encompassing modelH : . In the example, I have set " = 0:20, but one can set this value at whatever value seems reasonable for the given context. As in the example before, we use the posterior samples for  under H that were generated in Listing 1 to calculate the proportions of the posterior that satisfied the inequality constraints on imposed byH andH . These computations are performed in lines 9 and 14. The relevant proportions 0 1 of the unconstrained prior that obey the imposed inequality constraints are again calculated using the pcauchy command, as seen in lines 10 and 15. Finally, lines 11,16, and 19 calculate the relevant Bayes factors. As we can see, the Bayes factor B  2:2, indicating that the observed data are 2.2 times more likely under the model H : jj > " compared to the model H : jj < ". Notice that this 1 0 is similar to, but less than, the Bayes factor obtained with the point null hypothesis H :  = 0 from earlier in the paper. 18 Listing 5: Computing a Bayes factor for an interval null hypothesis using the encompassing approach 1 # interval null via encompassing priors 2 # H_0: "null" model: |delta| < epsilon 3 # H_1: |delta| > epsilon 4 # H_e: delta ~ Cauchy (unconstrained) 6 epsilon = 0.2 8 # H_0 versus H_e 9 postEvidential = mean(abs(posteriorDelta) < epsilon) 10 priorEvidential = pcauchy(epsilon ,0,1)-pcauchy(-epsilon ,0,1) 11 B_0e = postEvidential/priorEvidential 13 # H_1 versus H_e 14 postEvidential = mean(abs(posteriorDelta) > epsilon) 15 priorEvidential = (1-pcauchy(epsilon ,0,1)) + pcauchy(-epsilon ,0,1) 16 B_1e = postEvidential/priorEvidential 18 # H_1 versus H_0 19 B_10 = B_1e/B_0e; B_10 5. Conclusions In this tutorial, I have demonstrated a flexible approach to extending the default JZS t-test, a Bayesian test that is becoming increasingly popular in the social and behavioral sciences (Rouder et al., 2009). The approach uses two theoretical results, the Savage-Dickey density ratio (Dickey and Lientz, 1970) and the method of encompassing priors (Klugkist et al., 2005) in combination with an easy-to-use probabilistic modeling package for R called Greta (Golding, 2018). Though the examples presented in this paper are quite trivial to implement, they provide the reader with a general workflow that can be extended to solve problems relevant to his or her own work. Inherent in the techniques presented here is flexibility; the user has complete freedom to specify the underlying models and specific model comparisons in any way that he or she wishes. Finally, the Greta modeling language is easy to learn and readily extends to more complex modelsy. Furthermore, by harnessing the power of Google Tensorflow (Abadi et al., 2015), the MCMC sampler is fast, with all models described in the paper converging in less than one minute. In summary, I think this is an advantageous approach to using de- fault Bayesian tests for common hypothesis testing scenarios, especially those common in the social, behavioral, and other applied sciences. Acknowledgement I am grateful to the handling editor and two anonymous referees for their comments on an earlier version of this manuscript. References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, 19 R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, ´ D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vieg ´ as, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. Carpenter, B., Gelman, A., Ho man, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1). Dickey, J. M. and Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a markov chain. The Annals of Mathematical Statistics, 41(1):214–226. Faulkenberry, T. J. (2018). Computing Bayes factors to measure evidence from experiments: An extension of the BIC approximation. Biometrical Letters, 55(1):3143. Gabry, J. and Mahr, T. (2018). bayesplot: Plotting for Bayesian Models. R package version 1.6.0. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409. Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5):587–606. Gilks, W. R., Thomas, A., and Spiegelhalter, D. J. (1994). A language and program for complex bayesian modelling. The Statistician, 43(1):169–177. Golding, N. (2018). greta: Simple and Scalable Statistical Modelling in R. R package version 0.3.0.9001. Hoekstra, R., Morey, R. D., Rouder, J. N., and Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5):1157–1164. Hoel, P. G. (1984). Introduction to Mathematical Statistics. John Wiley & Sons, New York, 5th edition. JASP Team (2018). JASP (Version 0.9)[Computer software]. Je reys, H. (1961). The Theory of Probability (3rd ed.). Oxford University Press, Oxford, UK. Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795. Killeen, P. R. (2007). Replication statistics as a replacement for significance testing: Best practices in scientific decision-making. In Osborne, J. W., editor, Best practices in quantitative methods, pages 103–124. SAGE Publications, Inc., Thousand Oaks, CA. Klugkist, I., Kato, B., and Hoijtink, H. (2005). Bayesian model selection using encompassing priors. Statistica Neerlandica, 59(1):57–69. Kooperberg, C. (2018). polspline: Polynomial Spline Routines. R package version 1.1.13. Kooperberg, C. and Stone, C. J. (1992). Logspline density estimation for censored data. Journal of Computational and Graphical Statistics, 1(4):301–328. 20 Lunn, D. J., Thomas, A., Best, N., and Spiegelhalter, D. (2000). WinBUGS-a bayesian modelling framework: concepts, structure, and extensibility. Statistics and computing, 10(4):325–337. Masson, M. E. J. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis significance testing. Behavior Research Methods, 43(3):679–690. Morey, R. D. and Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16(4):406–419. Morey, R. D. and Rouder, J. N. (2018). BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.12-4.2. Neal, R. (2011). Mcmc using hamiltonian dynamics. In Brooks, S., Gelman, A., Jones, G., and Meng, X., editors, Handbook of Markov Chain Monte Carlo, pages 116–162. Chapman and Hall/CRC. Oakes, M. (1986). Statistical Inference: A commentary for the social and behavioural sciences. John Wiley & Sons, Chicester. Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sam- pling. R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25:111–163. Richard, F. D., Bond, C. F., and Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7(4):331–363. Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., and Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2):225–237. Stone, C. J., Hansen, M. H., Kooperberg, C., and Truong, Y. K. (1997). Polynomial splines and their tensor products in extended linear modeling: 1994 Wald memorial lecture. The Annals of Statistics, 25(4):1371–1470. Wagenmakers, E., Wetzels, R., Borsboom, D., and van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3):426–432. Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., and Grasman, R. (2010). Bayesian hypothesis test- ing for psychologists: A tutorial on the SavageDickey method. Cognitive Psychology, 60(3):158– Wang, M. (2017). Mixtures of g -priors for analysis of variance models with a diverging number of parameters. Bayesian Analysis, 12(2):511532. Wetzels, R., Grasman, R. P., and Wagenmakers, E.-J. (2010). An encompassing prior generalization of the savagedickey density ratio. Computational Statistics & Data Analysis, 54(9):20942102. 21 Wetzels, R., Raaijmakers, J. G. W., Jakab, E., and Wagenmakers, E.-J. (2009). How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default bayesian t test. Psychonomic Bulletin & Review, 16(4):752–760. Zellner, A. and Siow, A. (1980). Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa, 31(1):585–603. Appendix: In this appendix, I report a simulation study designed to benchmark performance of the posterior sampling generalization of the JZS t-test described in this paper against the version of the JZS test originally proposed by Rouder et al. (2009). For each simulation, I randomly generated 200 single-sample data sets of size N (where N = 20, 50, or 80) under the model y =  +" . For each of these data sets, di erent “e ects” were represented i i by varying the parameter , which was assumed to be drawn randomly from a normal distribution with mean 0 and variance g (where g = 0, 0.05, or 0.2; also see Wang, 2017; Faulkenberry, 2018). Each of the errors " for a given data set was drawn from a normal distribution with mean 0 and variance 1. The resulting combinations of 3 di erent sample sizes (N = 20; 50; 80) and 3 di erent e ect parameters (g = 0; 0:05; 0:2) produced a total of nine simulations. Once a simulated data set was constructed, I computed the Bayes factor B for the null hypothesis H :  = 0 over the alternative hypothesisH :  = 1 using two methods: (1) the JZS Bayes factor of 0 1 Rouder et al. (2009), and (2) the posterior sampling technique. The JZS Bayes factor was computed using the ttestBF function from the BayesFactor package in R, and the posterior sampling Bayes factor was computed using the methods described in this paper in Section 4.1 (i.e., drawing posterior samples using Greta, fitting a logspline estimate of the posterior, and then computing the Savage- Dickey density ratio by comparing the ordinates of  = 0 in the posterior and prior, respectively). Each Bayes factor was computed using a Cauchy prior of scale r = 1. In all, I found the two methods to be quite comparable to each other. To see why, let’s first inspect the distributions of Bayes factors obtained for each of the nine combinations of N and g. Figure A.1 shows these via overlaid density plots of log(B ) for each computation method. As one can readily see in Figure A.1, the density plots have considerable overlap, indicating that both methods produced very similar distributions of Bayes factors. Further evidence for the compatibility of the two techniques comes from Table A.1, shows five- number summaries for the values of log(B ) obtained for each condition. Additionally, I computed model selection consistency, defined as the proportion of simulated data sets for which the JZS Bayes factor and the posterior sampling Bayes factor led to the same model choice. For this computation, model selection was determined by computing log(B ). If log(B ) > 0, thenH was selected; other- 01 01 0 wise,H was selected (see also Faulkenberry, 2018). As is shown in Table A.1, the posterior sampling technique again produced a distribution of Bayes factors that was very similar to those obtained from the JZS Bayes factor, mirroring what is shown in Figure A.1. Critically, both computation methods selected the same model in a large percentage of the simulated data sets, as indicated by the large consistency values in Table A.1. In all, the proposed sampling method for computing Bayes factors described in this tutorial seems to be quite consistent with the established, albeit less flexible, JZS Bayes factor of Rouder et al. (2009). Thus, the researcher can be confident that the posterior sampling methods described in this paper not only a ord a great deal of flexibility, but also benchmark well against other established methods of computation. 22 g=0, N=20 g=0, N=50 g=0, N=80 1.5 1.0 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 −2 −1 0 1 2 −2 −1 0 1 2 −4 −2 0 2 g=0.05, N=20 g=0.05, N=50 g=0.05, N=80 0.4 0.2 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.0 −4 −2 0 2 −8 −4 0 −6 −3 0 3 g=0.2, N=20 g=0.2, N=50 g=0.2, N=80 0.15 0.09 0.2 0.10 0.06 0.1 0.05 0.03 0.0 0.00 0.00 −20 −15 −10 −5 0 −25 −20 −15 −10 −5 0 −30 −20 −10 0 log(BF) log(BF) log(BF) Figure A.1: Density plots of log(B ) for each of the nine combinations of ”e ect” g (0,0.05,0.2) and sample size N (20,50,80). The JZS Bayes factor distribution is displayed as a solid line, whereas the distribution of posterior sampling Bayes factors is displayed as a dashed line. Table A.1: Five-number summary of log(B ) and model selection consistency. g N BF type Min Q Median Q Max Consistency 1 3 0 20 JZS -1.77 -1.71 -1.51 -1.12 1.76 sampling -1.95 -1.70 -1.51 -1.08 1.88 0.985 50 JZS -2.20 -2.16 -2.01 -1.56 1.51 sampling -2.47 -2.16 -1.99 -1.53 2.49 0.985 80 JZS -2.43 -2.38 -2.20 -1.69 4.21 sampling -2.55 -2.36 -2.18 -1.61 2.90 1.000 0.05 20 JZS -4.43 0.23 1.11 1.61 1.77 sampling -4.95 0.27 1.12 1.62 2.04 0.995 50 JZS -10.86 0.08 1.48 1.97 2.20 sampling -4.95 0.27 1.12 1.62 2.04 0.975 80 JZS -8.09 -0.79 1.14 2.06 2.43 sampling -6.96 -0.79 1.16 2.08 2.51 0.945 0.2 20 JZS -10.98 -1.76 0.39 1.44 1.77 sampling -19.90 -1.81 0.25 1.42 1.86 0.975 50 JZS -23.85 -4.25 0.05 1.51 2.20 sampling -12.39 -3.19 0.01 1.49 2.36 0.965 80 JZS -37.79 -6.78 -1.13 1.67 2.43 sampling -25.60 -4.77 -1.11 1.85 2.65 1.000 density density density http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University)

A tutorial on generalizing the default Bayesian t-test via posterior sampling and encompassing priors

Statistics , Volume 2021 (1812) – Dec 7, 2018

Loading next page...
 
/lp/arxiv-cornell-university/a-tutorial-on-generalizing-the-default-bayesian-t-test-via-posterior-BKHP0Cfa0u

References (44)

ISSN
2383-4757
eISSN
ARCH-3347
DOI
10.29220/CSAM.2019.26.2.217
Publisher site
See Article on Publisher Site

Abstract

With the advent of so-called “default” Bayesian hypothesis tests, scientists in applied fields have gained ac- cess to a powerful and principled method for testing hypotheses. However, such default tests usually come with a compromise, requiring the analyst to accept a one-size-fits-all approach to hypothesis testing. Further, such tests may not have the flexibility to test problems the scientist really cares about. In this tutorial, I demonstrate a flexible approach to generalizing one specific default test (the JZS t-test; Rouder et al., 2009) that is becoming increasingly popular in the social and behavioral sciences. The approach uses two results, the Savage-Dickey density ratio (Dickey & Lientz, 1980) and the technique of encompassing priors (Klugkist et al., 2005) in com- bination with MCMC sampling via an easy-to-use probabilistic modeling package for R called Greta. Through a comprehensive mathematical description of the techniques as well as illustrative examples, the reader is presented with a general, flexible workflow that can be extended to solve problems relevant to his or her own work. Note: this paper is in press at Communications for Statistical Applications and Methods. Keywords: Bayes factors, Bayesian inference, hypothesis testing, MCMC sampling, JZS t-test, Savage-Dickey density ratio, encompassing priors 1. Introduction The t-test is one of the simplest, yet most enduring, examples of a hypothesis test that the social and behavioral scientist uses in his or her daily work. In the typical framework of null hypothesis significance testing (NHST), the t-test works by first assuming a null hypothesis, and then calcu- lating a t-score, which indexes the likelihood of obtaining some sample of observed data under the null hypothesis. If this probability is small, the scientist rejects the null in favor of some alternative hypothesis. Consider the following scenario that is often used when assessing the e ect of some treatment. Let th x and x denote measurements for the i participant in two di erent conditions (e.g., a pretest and i1 i2 posttest). Consider the di erence d = x x . A typical consideration is whether these di erences i i2 i1 are di erent from 0; answering this question in the armative would then imply that the treatment had some nonzero e ect. To answer this question, one can apply the standard one-sample t-test, which works by first assuming d  Normal(;  ); and then defining two competing hypotheses: a null hypothesis H :  = 0 and an alternative hypoth- esis H :  , 0. We then test H by computing 1 0 t = ; s= n Corresponding author: Assistant Professor, Department of Psychological Sciences, Tarleton State University, Stephenville, Texas, USA. E-mail: faulkenberry@tarleton.edu arXiv:1812.03092v2 [stat.CO] 5 Feb 2019 2 where d is the mean of the di erences d across all participants i = 1; : : : ; n, s is the sample standard deviation of the di erence scores d , and n is the sample size. Under the null H , the distribution of t i 0 is well known as Student’s t distribution, with density function 2 2 f (x) =   1 + ; where  represents degrees of freedom, and x 2 (1;1). The cumulative distribution function F(x) = f (u)du can then be used to index the probability of observing data at least as extreme as that which we observed under the null hypothesis H . Specifically, we compute p(jxj > t) = 1 F(t), a quantity commonly known as a p-value. If this probability is small (say, less than 5%), then one may decide to reject H and conclude that  , 0, thus implying that our treatment had some nonzero e ect. This idea is well known to practicing social and behavioral scientists. However, there are some properties of this procedure that are suboptimal for robust inference. For one, the procedure is asym- metric (Rouder et al., 2009). Suppose that one calculates a p-value above some commonly-used threshold like 5%. What decision does the researcher make? Surely the logical opposite of “reject H ” is “accept H ”. However, this decision rule is inconsistent. The reason for this follows from 0 0 examining the distribution of p-values that result from increasing sample sizes. When the null is false (i.e.,  ,  ), the value of t increases as sample sizes increase. Thus, the probability of rejecting H 1 2 0 increases accordingly. However, if the null is true, p-values are uniformly distributed between 0 and 1, regardless of sample size. So, whereas a false null hypothesis can always be rejected if sample size is large enough, a true null hypothesis is always susceptible to being incorrectly rejected. Such incon- sistency leads to asymmetry in the testing procedure – increasing sample size can increase evidence against a false null hypothesis, but there is no corresponding way to increase evidence for a true null hypothesis. Another criticism of the traditional hypothesis testing procedure is that researchers often misin- terpret the results of such tests. Hoekstra et al. (2014) asked 562 researchers and students from the field of psychology to assess the validity of six di erent statements involving incorrect interpretations of confidence intervals (e.g., “The probability that the true mean is greater than 0 is at least 95%”). Although each of these statements was false, both students and researchers on average believed at least 3 of the statements were true. Furthermore, researchers did no better than students with respect to these misunderstandings. This finding echos results by Oakes (1986), who performed a similar study using statements about p-values; see also Gigerenzer (2004). In light of these criticisms, the social and behavioral sciences have seen an increase in recommen- dations to find alternatives to the orthodox use of null hypothesis significance testing (Wagenmakers et al., 2011). One such alternative is to use Bayesian inference, and in particular Bayes factors (Kass and Raftery, 1995; Raftery, 1995; Masson, 2011). Bayesian inference is based on calculating the posterior probability of a hypothesis H after observing data y. This calculation proceeds by Bayes’ theorem, which states p(y j H ) p(H ) p(H j y) = : (1.1) p(y) One way to think of Equation 1.1 is as follows: prior to observing data, one assigns a prior belief p(H ) to a hypothesis H . Once the data y have been observed, one updates this prior belief to a posterior belief p(H j y) by multiplying the prior p(H ) by the likelihood p(y j H ). This product is then rescaled to meet the requirements for being probability distribution (i.e., total probability = 1) by dividing by p(y), the marginal probability of the observed data averaged across all possible hypotheses H . 3 While this computation is fundamentally quite basic, one immediate consequence is how it can be used for comparing two hypotheses. Suppose as above that we have two competing hypotheses: a null hypothesisH and an alternative hypothesisH . We can directly compare our posterior beliefs in 0 1 these two hypotheses by computing their ratio p(H j y)= p(H j y), which we call the posterior odds 0 1 for H over H . Using Bayes’ theorem (Equation 1.1), we can readily see 0 1 p(H j y) p(y j H ) p(H ) 0 0 0 =  : (1.2) p(H j y) p(y j H ) p(H ) 1 1 1 | {z } | {z } | {z } posterior odds Bayes factor prior odds As with Bayes’ theorem, Equation 1.2 can also be interpreted in terms of an “updating” metaphor. Specifically, the posterior odds ratio is equal to the prior odds ratio multiplied by an updating factor. This updating factor is the ratio of the marginal likelihoods p(y j H ) and p(y j H ), and is called 0 1 the Bayes factor (Je reys, 1961; Kass and Raftery, 1995). The Bayes factor is the weight of evidence provided by data y. For example, suppose that one assigned the prior odds of H to H to be equal to 0 1 4-to-1; that is, we believe that, a priori, H is 4 times more likely to be true than H . Then, suppose 0 1 that after observing data, we compute a Bayes factor was computed to be 5. Now, the posterior odds (the odds of H over H after observing data) is 20-to-1 in favor of H over H . 0 1 0 1 There are two immediate advantages to using the Bayes factor for inference. First, the Bayes factor is a ratio, and thus, is subject to a natural interpretation. Simply put, larger is better - the bigger the Bayes factor, the bigger the weight of evidence provided by the observed data. Second, since there was no specific assumption about the order in which we addressed H and H , we could have just as 0 1 easily measured the weight of evidence in favor of H over H . In fact, once we have a Bayes factor 1 0 in favor of one hypothesis, a simple reciprocal will give us the Bayes factor in favor of the the other hypothesis. In our example above, the Bayes factor for H over H would have been 1/5, implying 1 0 that the data would actually decrease our relative belief in H over H . Because we can compute 1 0 Bayes factors from either direction, we must be careful to define our notation carefully. In this paper, I will adopt the common convention to define B as the Bayes factor for H over H . Similarly, B 01 0 1 10 would represent the Bayes factor for H over H . Note that, by our discussion above, B = 1=B . 1 0 01 10 Though the previous discussion certainly speaks positively about the benefits of using the Bayes factor as a tool for inference, there are some important considerations that the researcher must address before implementing it as a tool for inference. First, as we’ll see in the discussion below, the Bayes factor requires the analyst to specify prior distributions on all parameters in the underlying model. Thus, a given Bayes factor reflects a specific choice of prior. Second, the computation of these Bayes factors is usually nontrivial. For example, computing the either the numerator or denominator of Equation 1.2 requires explicitly defining the hypothesisH as a model consisting of vectors  in some parameter space  and integrating the likelihood weighted by a prior distribution on these parameters; that is, p(y j H ) = f (y j ;H ) p( j H )d; i i i where f is the likelihood function, and p denotes the prior distribution on parameters  2  under model H . However, the last decade has seen the development of many tools intended to simplify the calculation of Bayes factors for the common models used by applied researchers, including online calculators, standalone software packages such as JASP (JASP Team, 2018), and a wide range of packages for the statistical computing environment R (R Core Team, 2018), including the package BayesFactor (Morey and Rouder, 2018). Because these solutions are designed to work across a variety 4 of contexts, one must necessarily assume some defaults with respect to the models that underly these calculators. Many have argued that these defaults represent reasonable assumptions about the types of problems with which many applied researchers are concerned. However, recent advances in statistical computing have made it easier for the practicing researcher to build his or her own custom models for a given situation and compute Bayes factors to compare these models. In this paper, I will provide a tutorial with particular focus on extending one type of Bayesian model comparison known as the Je reys-Zellner-Siow t-test (Rouder et al., 2009) (henceforth abbre- viated as JZS t-test). Specifically, I will describe a generalization that provides an adaptable, compu- tationally ecient method for computing Bayes factors in a variety of single-sample and independent- samples designs. The organization of the paper is as follows. First, I will describe the mathematical underpinnings of the JZS t-test. Then, I will present two results which allow us to generalize the JZS t-test to a broader class of model comparisons: (1) the Savage-Dickey density ratio (Dickey and Lientz, 1970; Wagenmakers et al., 2010; Wetzels et al., 2009), which is used for comparing models in which one is a sharp hypothesis (e.g, a point null hypothesis) nested within another unconstrained model; and (2) the encompassing prior technique (Klugkist et al., 2005), which is used for comparing nested models with ordinal constraints. Finally, I will demonstrate (with examples) how to use these techniques along with posterior sampling (Gelfand and Smith, 1990) to compute Bayes factors for model comparisons involving both point-null hypotheses (i.e., testing whether an e ect is exactly 0), directional hypotheses (i.e., testing whether an e ect is postive compared to whether it is negative), and interval-null hypotheses (i.e., testing whether an e ect is approximately 0). 2. The JZS t-test The JZS t-test (Rouder et al., 2009) was developed as a default Bayesian version of the orthodox t-test described above. We will denote by y a vector of observed data, and assume as above that y is normally distributed with mean  and variance  . We can then explicitly define two competing hypotheses: a null hypothesis H and an alternative hypothesis H . Both hypotheses can be parameterized with 0 1 two parameters,  and  . Under the alternative hypothesisH , we allow  and  to freely vary. That is, H is an unconstrained model;  could be positive, negative, or zero. For the null hypothesis H , 1 0 which states that the mean of data y is equal to 0, we can simply constrain  to be 0. Thus, we say that H is nested within H . Under these models, we can compute the Bayes factor B = p(y j H )= p(y j 0 1 01 0 H ), where 2 2 2 p(y j H ) = f (y j  = 0;  ;H ) p( ;H )d 0 0 0 and Z Z 1 1 2 2 2 p(y j H ) = f (y j ;  ;H ) p(;  ;H )d d: 1 1 1 1 0 2 2 These computations require placing priors on  under the null model H and both  and  under the alternative model H . Following Je reys (1961) and Zellner and Siow (1980), Rouder et al. reparameterized the problem by placing a Cauchy prior on e ect size  = =. That is, under H , Cauchy(0; r), where r represents the scale of expected e ect sizes, and underH ,  = 0. Rouder et 2 2 2 al. placed a Je rey’s prior on  ; specifically, p( ) / 1= . With these default prior specifications, Rouder et al. (2009) showed that the Bayes factor can be computed as t 2 1 + B = ; (2.1) 1 1 2 1 3 t 1 2 2 2 (1 + Ngr ) 1 + (2) g exp dg 2g 0 (1+Ngr ) 5 where t is the orthodox t statistic, r is the scale on the e ect size prior, N is the number of observations in y, and  = N 1 denotes the degrees of freedom. Though computationally convenient, this JZS Bayes factor formula has a few disadvantages. First, it reflects a very specific choice of prior specification. Though using a Cauchy prior on e ect size may be a reasonable choice for many researchers, especially in the behavioral sciences (Rouder et al., 2009), others may argue for a di erent prior. For example, Killeen (2007) used meta-analytic data from Richard et al. (2003) to argued that e ect sizes in social psychology are typically normally distributed with variance equal to 0.3. Certainly, one advantage of a Bayesian approach is that the prior on e ect size should reflect the analyst’s prior belief on what e ect sizes should be expected. For some fields of study, these e ect sizes may be expected to be small (i.e., social psychology), whereas in other fields, these e ect sizes may be expected to be larger. Also, note that the reason for using the Cauchy prior (instead of a normal prior) in the JZS Bayes factor is one of computational convenience; it simply makes the computation work out. If the analyst wants to use a di erent prior, the formula for the JZS Bayes factor in Equation 2.1 would have to be recomputed, a task which would only be accessible to those researchers with the appropriate mathematical background. Another disadvantage of the JZS Bayes factor is that it forces the analyst into a very specific hypothesis test; that is, H :  = 0 versus H :  , 0. One may be interested instead in more flexible 0 1 testing situations – for example, testing a directional hypothesis H :  > 0 versus H :   0, or 0 1 an interval hypothesis such as H : " <  < " versus an alternative model H : jj > " (Morey 0 1 and Rouder, 2011). These tests would each require a major readjustment to the derivation of the JZS Bayes factor. Instead, I propose that we approach problems like these using a fundamentally di erent set of tools, which I will now describe. 3. Generalizing the JZS t-test In this section, I describe a method for extending the default JZS t-test of Rouder and colleagues to use a wider class of priors on e ect size. This generalization thus allows the analyst more freedom to specify a prior that may better reflect his or her a priori expectation about what e ect sizes are typically encountered in a given field. The original description of this method is due to Wetzels et al. (2009) and Wagenmakers et al. (2010), though the software implementation and specific extensions I will describe are novel. The core method relies on a result known as the Savage-Dickey density ratio (Dickey and Lientz, 1970), which states that the Bayes factor for a pair of models in which one of the models is a one- point restriction of the other (i.e., H :  = 0 versus H :  , 0) is simply the ratio of the ordinates 0 1 of the one point of interest (i.e.,  = 0) in the posterior and prior densities, respectively. The technical formulation and proof will be presented momentarily. For now, however, one should appreciate that this can be a great simplification over other methods of computing Bayes factors presented above, since there is no need for integration. Assuming one can estimate the prior and posterior densities via some sampling method (e.g., Markov chain Monte Carlo, or MCMC, sampling), then the computation of this ratio of densities is straightforward. The Savage-Dickey density ratio can stated rigorously as the following proposition: Proposition 1. (Savage-Dickey Density Ratio) Consider two competing models on data y containing parameters  and ', namelyH :  =  ; ' andH : ; '. In this context, we say that  is a parameter 0 0 1 of interest, ' is a nuisance parameter (i.e., common to all models), andH is a sharp point hypothesis nested within H . Suppose further that the prior for the nuisance parameter ' in H is equal to the 1 0 6 prior for ' in H after conditioning on the restriction – that is, p(' j H ) = p(' j  =  ;H ). Then 1 0 0 1 p( =  j y;H ) 0 1 B = : p( =  j H ) 0 1 Proof: By definition, the Bayes factor is the ratio of marginal likelihoods over H and H , respec- 0 1 tively. That is, p(y j H ) B = : (3.1) p(y j H ) The key idea in the proof is that we can use a “change of variables” technique to express B entirely in terms ofH . This proceeds by first unpacking the marginal likelihood forH over the nuisance pa- 1 0 rameter ' and then using the fact thatH is a sharp hypothesis nested withinH to rewrite everything 0 1 in terms of H . Specifically, p(y j H ) = p(y j ';H ) p(' j H )d' 0 0 0 = p(y j ';  =  ;H ) p(' j  =  ;H )d' 0 1 0 1 = p(y j  =  ;H ): 0 1 By Bayes’ Theorem, we can rewrite this last line as p( =  j y;H ) p(y j H ) 0 1 1 p(y j  =  ;H ) = : 0 1 p( =  j H ) 0 1 Thus we have p(y j H ) 1 B = = p(y j H ) 01 0 p(y j H ) p(y j H ) 1 1 = p(y j  =  ;H ) 0 1 p(y j H ) p( =  j y;H ) p(y j H ) 1 0 1 1 p( =  j H ) p(y j H ) 0 1 1 p( =  j y;H ) 0 1 = : p( =  j H ) 0 1 The beauty of Proposition 1 is that it allows one to calculate the Bayes factor for a point null hypothesis (i.e, H :  = 0) by simply computing two densities: (1) the density of  = 0 in the posterior, and (2) the density of  = 0 in the prior. Then, the Bayes factor results by taking the ratio of these posterior and prior densities, respectively. Given this result, this changes the problem of computing Bayes factors from one of integration (e.g, the JZS Bayes factor) to one of estimating prior and posterior densities. 7 3.1. Computing the Savage-Dickey density ratio The Savage-Dickey density ratio is an elegant solution to the problem of computing Bayes factors in situations involving a point null hypothesis H . All that is required is that one can compute samples from the posterior of an e ect size parameter under a specified alternative model H . I think casting the problem in the context is preferable, not only for its flexibility, but especially given the broad class of computer methods now available for sampling posteriors in Bayesian models, including BUGS (Lunn et al., 2000), JAGS (Plummer, 2003), and Stan (Carpenter et al., 2017). I will now focus on one recent addition to this collection – Greta (Golding, 2018). Greta is an R package designed for sampling from Bayesian models. It provides a reasonably simple language for modeling that is implemented directly within R, eliminating the need for writ- ing models in another language (e.g., JAGS, Stan) and then having to call these external files from within R. Further, Greta uses the computational power of Google TensorFlow (Abadi et al., 2015), so it provides fast convergence based on Hamiltonian Monte Carlo sampling (Neal, 2011), it scales well to very large datasets, and it can even be configured to run on GPUs, providing the ability for massive parallel computation. Moreover, it is a free download from the Comprehensive R Archive 1; Network (CRAN) , and as such can be installed directly from within R by typing the command install.packages(``greta'') at the R console. Note that fitting models with Greta will require the user to have a working installation of Python packages for TensorFlow (version 1.10.0 or higher) and tensorflow-probability (version 0.3.0 or higher). Once Greta is installed, the startup message will provide the user with system-specific instructions on how to install these two packages. While this step can be tricky, most errors can be addressed by following the recommendations on the Greta help 2; 3; page and the TensorFlow help page . To illustrate how Greta works, we will first look at a model inspired by that which was initially described by Wetzels et al. (2009) as an alternative to the JZS t-test. As is common in these types of models, the model is depicted as a graphical model (Gilks et al., 1994) in Figure 1. In such graphical models, we use the various nodes to represent all variables of interest. Dependencies between these variables are indicated with graph structure. Deterministic nodes are denoted as rhombuses (i.e., rotated squares), whereas stochastic nodes are represented by unshaded circles. Finally, we denote observed variables by shaded nodes. In this model, our data y is assumed to be drawn from a normal distribution with mean  and variance  . As in the discussion of the JZS t-test above, we consider e ect size  = = as our main parameter of interest. For a fully Bayesian specification, we must place priors on  and . For this first example, we follow Rouder et al. (2009) and Wetzels et al. (2009) and adopt their recommendations of placing a half-Cauchy prior on  (that is, one of the symmetric halves of the Cauchy(0,1) distribution that is defined for positive numbers only). Critically, we assume that e ect size  is distributed as Cauchy(0,1); the scale value of 1 indicates that, a priori, we believe that 50% of our e ect sizes would lie between -1 and 1. Of course, as we’ll see below, if this doesn’t reflect the analyst’s prior belief about , this prior can be easily changed. This choice of model allows us to define two competing hypotheses: H :  = 0 H :   Cauchy(0; 1) https://CRAN.R-project.org/package=greta https://greta-stats.org/articles/get started.html https://tensorflow.rstudio.com/tensorflow/articles/installation.html 8 δ ∼ Cauchy(0, 1) μ = δσ σ ∼ Cauchy(0, 1) y ∼ Normal(μ, σ ) Figure 1: A graphical model for a posterior sampling Bayesian t-test. Table 1: Example data for a single-sample t test Patient 1 2 3 4 5 6 7 8 9 10 Hours gained 0.7 -1.1 -0.2 1.2 0.1 3.4 3.7 0.8 1.8 2.0 AsH is a sharp hypothesis nested withinH , we can apply Proposition 1 (the Savage-Dickey density 0 1 ratio) and compute p( = 0 j y;H ) B = : p( = 0 j H ) To do this, we’ll need to draw samples from the posterior distribution of  under H and estimate the height of  = 0 in an estimated density function for this posterior. All of this can be done in R, as I will now illustrate. To begin, let us consider a simple example, the type of which can be found in most elementary statistics textbooks. This example comes from Hoel (1984). Suppose that 10 patients take part in an experiment on a new drug that is supposed to increase sleep in the patients. Table 1 shows the hours of sleep gained by each patient (negative values indicate lost sleep). Assuming that the sample data are normally distributed with mean  and variance  , we can test the hypothesis H :  = 0 against H :  , 0. The R code necessary to perform a Bayesian t-test on this data is displayed in Listing 1. The first step will be to load the Greta library (see line 2). After this, we need to assign our sample data to a vector y and then convert these to z-scores (see lines 5-6). The next step is to define the prior distributions on  and . The Greta syntax allows this to be done in a quite straightforward manner (see lines 9-10). Further, any deterministic operations should then be defined, as we do in line 13. Then, we can define our likelihood for the z-scores. The wording of the syntax has a nice advantage here, as it describes exactly what we are assuming about our scores; namely, that they are normally distributed with mean  and variance  (see line 16). The last step in setting up the model is to define the model; that is, we collect all of the variables of interest in our analysis. We have three: , , and , which we collect together in line 19. Now we are finally ready to sample from the posterior distributions of the variables in our model. We will focus our interest on delta, but Greta will automatically sample all posteriors for us. This step, displayed in line 22, will take a little while, depending on computing resources. Once the sampling is complete, there are two ways to inspect the samples before proceeding to our inference. The first is to type summary(draws); this will show us various descriptive statistics 9 Listing 1: Building and sampling from the single-sample t-test model 1 # load libraries 2 library(greta) 4 # data from Hoel (1984) 5 y = c(0.7,-1.1,-0.2,1.2,0.1,3.4,3.7,0.8,1.8,2.0) 6 z = y/sd(y) 8 # priors 9 delta = cauchy(0,1) 10 sigma = cauchy(0,1,truncation = c(0,Inf)) 12 # operations 13 mu = delta*sigma 15 # likelihood 16 distribution(z) = normal(mu,sigma) 18 # define model 19 m = model(mu, sigma , delta) 21 # draw samples 22 draws = mcmc(m, n_samples=5000) of the samples, including mean, standard deviation, standard error, and quantiles. In our example, there will be three lines of output; one for each of , , and . Another way to look at the samples is to inspect the path of the samples over time as they explore the posterior distributions. This is done 4; by using the mcmc_trace command from the bayesplot package (Gabry and Mahr, 2018). If the samples converged appropriately, one should see the characteristic “hairy caterpillar” plot, indicating that the chains mixed well and truly randomly explored the posteriors. We will now look at how to compute Bayes factors necessary to compare the models H and H . 0 1 We will do this by computing the Savage-Dickey density ratio. Recall from Proposition 1 that in order to compute B , we simply need to compute the ordinate of  = 0 in the densities of the prior and posterior, and then take their ratio. We already know the density function for the prior, and it is implemented in R as the dcauchy function. However, since we are using samples to approximate the posterior, we need a way to estimate its density function from the samples. One such method is to use a logspline density estimator (Kooperberg and Stone, 1992; Stone et al., 1997), which is implemented 5; by the function logspline from the R package polspline (Kooperberg, 2018). Listing 2 shows the R code necessary to both (1) plot the ordinates from the prior and posterior densities for  = 0 under H , and (2) compute BF as the ratio of these ordinates. The first step is 1 01 to extract the relevant samples from the posterior for  from the object draws (see line 4). Note that if the analyst is interested in other parameters (e.g.,  or , the [,3] part of line 4 can be adjusted appropriately. We can then perform the logspline estimate of the posterior density for , as shown on line 5. https://CRAN.R-project.org/package=bayesplot https://CRAN.R-project.org/package=polspline 10 Table 2: Example data for a two-sample t-test. Raw 62 60 56 63 56 63 59 56 44 61 Roasted 57 56 49 61 55 61 57 54 62 58 From here, there are two paths worth exploring. First, I typically will produce a plot showing the components of the Savage-Dickey density ratio. Such a plot will consist of (1) a plot of the prior density for  under H (the Cauchy(0,1) distribution); (2) a plot of the posterior density (from the logspline estimate); and (3) the ordinates of  = 0 in both of these densities. Lines 7-21 will produce such a graph, which can be seen in Figure 2. Next, we can compute the Savage-Dickey density ratio using the code on lines 24-30. Lines 24-25 compute the specific ordinates required. posterior represents the ordinate of  = 0 in the posterior distribution. Since the posterior was estimated from the logspline function, we call this estimate for our calculation using dlogspline along with the object name of our estimate from line 5 (i.e, fitPost). prior represents the ordinate of  = 0 under H ; computing this uses a simple call to the dcauchy function. Finally, we can divide posterior by prior to compute B , which is denoted in line 26 by BF01. Lines 29 and 30 then simply display both B (the Bayes factor in favor of H 01 0 over H ) and B (the Bayes factor in favor of H over H ). From this computation, one can see that 1 10 1 0 we have moderate support for H , as B  0:4. The intuition for this can be had by looking at how 1 01 density at  = 0 changes from prior to posterior. In Figure 2, the posterior density of  = 0 decreases relative to the prior density, indicating that our belief in a null e ect decreases after observing data x. Equivalently, we can use the reciprocal to compute B  2:6, indicating that our data are 2.6 times more likely under the alternative model H compared to the null model H , giving us positive 1 0 evidence in favor of a nonzero e ect . Now, suppose the analyst had a di erent a prior belief about the e ect sizes he or she would expect in this context. For illustration, let us suppose that  is normally distributed with mean  = 0 and variance  = 0:3, as recommended by Killeen (2007). In this case, all of the above code could be run again with some minor changes. First, one would need to change line 9 in Listing 1 to delta = normal(0,sqrt(0.3)). Also, any line in Listing 2 that uses dcauchy(x,0,1) would need to be replaced by dnorm(x,0,sqrt(0.3)). After this minor change, the resulting Bayes factor would be B  0:25, or equivalently, B  4:0. Note that this Bayes factor is a bit larger than the 01 10 previous one in which the Cauchy prior was used for . The reason folllows from the fact that the Cauchy prior is more dispersed relative to a normal prior, and thus with a normal prior, we have a relatively greater prior mass on smaller e ects. Particularly, the ordinate of  = 0 is larger in the normal prior compared to the Cauchy prior. The result is that the ratio between the ordinates of  = 0 in the prior and posterior becomes larger for the normal prior, thus giving us a larger Bayes factor. 3.2. Using the Savage-Dickey density ratio for a two-sample design The methods described above will readily scale up to problems involving two independent samples. All that is required is that the underlying model is adjusted accordingly. I will illustrate this with another example from Hoel (1984). Consider a sample of 20 rats, each of which receives their main source of protein from either raw peanuts or roasted peanuts. To compare weight gains as a function of protein source, a researcher randomly assigns 10 rats to receive only raw peanuts and 10 rats to receive only roasted peanuts. The resulting weight gains (in grams) are displayed in Table 2. First, we must consider the underlying model. As with the single-sample example, we can repre- sent this model as a directed acyclic graph, which is shown in Figure 3. In this model (inspired by 11 Listing 2: Plotting and computing the Savage-Dickey density ratio 1 Library(polspline) 3 # extract draws from MCMC object and fit a density estimate 4 posteriorDelta = draws[[1]][,3] 5 fitPost = logspline(posteriorDelta) 7 x = seq(-2,2,length.out=1000) 9 # plot density of prior and posterior together 10 plot(x, dlogspline(x, fitPost), type="l", main="", xlab="delta", 11 ylab="density", xlim=c(-2,2), lwd=2, lty=1) 13 # add prior 14 lines(x, dcauchy(x,0,1), lwd=2, lty=3) 16 # add points at 0 for both prior and posterior 17 points(0, dlogspline(0, fitPost), pch=19) 18 points(0, dcauchy(0,0,1), pch=19) 20 legend(-2,0.8, legend=c("Prior density","Posterior density"), 21 lty=c(3,1), lwd=c(2,2), bty="n") 23 # compute SD density ratio 24 posterior <- dlogspline(0, fitPost) 25 prior <- dcauchy(0,0,1) 26 BF01 <- posterior/prior 28 # display both Bayes factors 29 BF01 30 1/BF01 Wetzels et al., 2009), both independent samples x and y are assumed to be drawn from two normal distributions with shared variance  . The mean of the parent distribution of x is  + =2 and the mean for the parent distribution of y is  =2. With this parameterization, represents the “e ect” or di erence between the two populations. As with the single-sample example, we then scale this ef- fect to a standardized e ect  = =. Also, standard Cauchy priors are placed on , , and a truncated Cauchy prior is placed on . For concreteness, let us denote the sample of weight gains from the raw peanut diet as x and the weight gains from roasted peanuts as y. Given the model in Figure 3, our goal is to sample from the posterior distribution of . The R code necessary to perform this sampling is displayed in Listing 5. The procedure is similar to the single-sample model in Listing 1, but there are some notable modifications that are particular to the independent samples model. First, we need to rescale the raw data vectors x and y to z-scores. Since we are assuming shared variance, it suces to base both z- score transformations on only one of x and y. In lines 9-10, I have chosen to base the z-scores on x, but note that similar results will be obtained if instead the researcher chooses to base all z-scores on y. Lines 13-15 define the priors that we assigned to the parameters in our model. Lines 18-24 reflect our 12 Prior density Posterior density −2 −1 0 1 2 delta Figure 2: A plot showing the necessary components for computing the Savage-Dickey density ratio. Included are the density of the prior for  under H (i.e., a Cauchy(0,1) distribution, depicted as a dashed line) as well as the logspline density estimate for the posterior of  (depicted as a solid line). The Bayes factor B can be computed as the ratio of the ordinates of  = 0 under the posterior and prior, respectively. δ ∼ Cauchy(0, 1) μ ∼ Cauchy(0, 1) α σ σ ∼ Cauchy(0, 1) α = δσ x ∼ Normal(μ + α/2, σ ) x 2 y ∼ Normal(μ − α/2, σ ) Figure 3: A graphical model for the posterior sampling independent samples t-test. assumption that data x and y are randomly drawn from two normal distributions centered at  + =2 and  =2, respectively. In lines 27 and 30 we tell Greta to pull 5000 samples from the posterior distribution of ; note that for simplicity, I have only included delta in the model, though one could add any other variable in the model if desired. These posterior samples are then extracted into the density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 13 Listing 3: Building and sampling from the independent-samples t-test model 1 # load libraries 2 library(greta) 4 # data from Hoel (1984) 5 x = c(62, 60, 56, 63, 56, 63, 59, 56, 44, 61) 6 y = c(57, 56, 49, 61, 55, 61, 57, 54, 62, 58) 8 # rescale so that x has mean=0 and sd=1 9 zx = (x-mean(x))/sd(x) 10 zy = (y-mean(x))/sd(x) 12 # priors 13 delta = cauchy(0,1) 14 mu = cauchy(0,1) 15 sigma = cauchy(0,1,truncation = c(0,Inf)) 17 # operations 18 alpha = delta*sigma 19 mux = mu + alpha/2 20 muy = mu - alpha/2 22 # likelihood 23 distribution(zx) = normal(mux,sigma) 24 distribution(zy) = normal(muy,sigma) 26 # define model 27 m = model(delta) 29 # MCMC sample 30 draws = mcmc(m, n_samples = 5000) 32 # extract draws from MCMC object 33 posteriorDelta = draws[[1]][,1] vector posteriorDelta in line 33. After completing the code in Listing 5, the Savage-Dickey density ratio can be plotted and com- puted as we did earlier in Listing 2. Figure 4 shows this ratio graphically; indeed, note that the poste- rior density of  = 0 increases relative to the prior density. This ratio is computed to be B = 2:92, indicating that the data are 2.92 times more likely under H than under H . Thus, we can conclude 0 1 positive support for a null e ect of peanut type on rats’ weight gain. 4. Extension: using encompassing priors for inequality constraints In the previous section, I described an extension of the JZS t-test that uses MCMC sampling to ap- proximate the posterior distribution of e ect size . This method works for sharp hypotheses (i.e., a point null, such as H :  = 0) by employing the Savage-Dickey density ratio, which reduces the calculation of the Bayes factor B into a simple ratio based on the ordinates of the point  = 0 in both 01 14 Prior density Posterior density −2 −1 0 1 2 delta Figure 4: A plot showing the necessary components for computing the Savage-Dickey density ratio in the independent-samples model. Included are the density of the prior for  under H (i.e., a Cauchy(0,1) distri- bution, depicted as a dashed line) as well as the logspline density estimate for the posterior of  (depicted as a solid line). The Bayes factor B can be computed as the ratio of the ordinates of  = 0 under the posterior and prior, respectively. the prior and posterior distributions for . Consider again the sleep example above. What if instead the researcher wanted to whether the new drug increased sleep in patients? This would require the ability to test a directional hypotheses H :  > 0 against H :   0. At first glance, this seems like quite a di erent problem, as the 1 0 Savage-Dickey density ratio does not directly apply to models with inequality constraints. However, there is a method due originally to Klugkist et al. (2005) that fits with this type of problem. In their approach, Klugkist et al. cast such problems as one of testing models with inequality constraints nested within an encompassing model. In this context, both hypotheses H and H are considered as 0 1 specific inequality constraints nested within an encompassing modelH : , where  is unconstrained (i.e.,  2 R). The Klugkist et al. approach (which I will hereafter call the encompassing approach) amounts to using MCMC samples to calculate p(y j H ) B = 0e p(y j H ) and p(y j H ) B = : 1e p(y j H ) density 0.0 0.2 0.4 0.6 0.8 1.0 15 Once these two Bayes factors are computed, one can use transitivity of Bayes factors to compute B = B  B = B  : 01 0e e1 0e 1e The mechanics of the encompassing approach can be summarized in the following proposition: Proposition 2. Consider two models H and H , where H is nested within an encompassing model 1 e 1 H via an inequality constraint on some parameter , and  is unconstrained under H . Then e e c 1=d B = = 1e d 1=c where 1=d and 1=c represent the proportions of the posterior and prior of the encompassing model, respectively, that are in agreement with the inequality constraint imposed by the nested model H . Proof: Consider first that for any modelH on data y with parameter vector , Bayes’ theorem implies f (y j ;H ) p( j H ) t t p( j y;H ) = : p(y j H ) Thus, we can write the marginal likelihood for y under H as f (y j ;H ) p( j H ) t t p(y j H ) = : p( j y;H ) Taking the ratio of the marginal likelihoods for H and the encompassing model H yields the fol- 1 e lowing Bayes factor: f (y j ;H ) p( j H )= p( j y;H ) 1 1 1 B = : 1e f (y j ;H ) p( j H )= p( j y;H ) e e e Now, both the constrained model H and the encompassing model H contain the same parameters 1 e . Choose a specific value of , say  , that exists in both models H and H (we can do this because 1 e 0 0 0 H is nested within H . Then, for this parameter value  , we have f (y j  ;H ) = f (y j  ;H ), so 1 e 1 2 the expression for the Bayes factor reduces to an expression involving only the priors and posteriors for  under H and H : 1 e 0 0 p( j H )= p( j y;H ) 1 1 B = : 1e 0 0 p( j H )= p( j y;H ) e e BecauseH is nested withinH via an inequality constraint, the prior p( j H ) is simply a truncation 1 e 1 0 0 of the encompassing prior p( j H ). Thus, we can express p( j H ) in terms of the encompassing e 1 prior p( j H ) by multiplying the encompassing prior by an indicator function over H and then e 1 normalizing the resulting product. That is, p( j H ) I e  2H p( j H ) = 0 0 p( j H ) I 0 d e  2H 2H = R  p( j H ); 0 0 p( j H ) I 0 d e  2H 1 16 where I 0 is an indicator function. For parameters  2 H , this indicator function is identically 2H 1 equal to 1, so the expression in parentheses reduces to a constant, say c, allowing us to write 0 0 p( j H ) = c p( j H ): 1 e By similar reasoning, we can write the posterior as I 0 2H 0 0 0 p( j y;H ) =  p( j y;H ) = d  p( j y;H ): 1 e e 0 0 p( j y;H )I 0 d e  2H This gives us 0 0 c p( j H )=d  p( j y;H ) c 1=d e e B = = = : 1e 0 0 p( j H )= p( j y;H ) d 1=c e e Note that by definition, 1=d represents the proportion of the posterior distribution for  under the encompassing model H that agrees with the constraints imposed by H . Similarly, 1=c represents e 1 the proportion of the prior distribution for  under the encompassing model H that agrees with the constraints imposed by H . It might seem a bit odd to represent the fraction c=d in the form (1=d)=(1=c). However, this is again done for a computational advantagem, as we can use MCMC sampling to easily estimate the proportions 1=d and 1=c. Also note that in some sense, the encompassing prior approach of Klugkist et al. (2005) is a generalized version of the Savage-Dickey density ratio. Indeed, Wetzels et al. (2010) proved that under “about equality” constraints (e.g., a constrained model M : " <  < " for " > 0), the Bayes factor derived from the encompassing approach tends toward the Bayes factor (for the point null where  = 0) obtained from the Savage-Dickey density ratio as " ! 0. 4.1. Computing Bayes factors with the encompassing approach To illustrate the computation of Bayes factors with the encompassing approach, let us consider the problem mentioned immediately above – suppose we wanted to test whether the drug that we ad- ministered to sleep patients actually increased the patients’ sleep. Specifically, we wish to compare H :   0 against H :  > 0. We will do this by considering both H and H as models with 0 1 0 1 inequality constraints nested with an encompassing model H : , where  is unconstrained. Then, we can use transitivity to compute B = B  B = B =B . 10 1e e0 1e 0e The R code necessary to perform these computations is in Listing 4. As the encompassing model is defined identical to that from Figure 1, the code assumes that we have already drawn samples from that model, as we did in Listing 1. Note that just like our previous computations with the Savage- Dickey density ratio, using the encompassing approach requires that we sample from the posterior of under the unconstrained, encompassing model H . Though the notation is di erent, this is exactly the same posterior distribution that we sampled from in Listing 1. The key steps in Listing 4 are as follows. First, we will compare H to the encompassing model H . To this end, we need to compute the proportion of posterior samples from the encompassing model that are in agreement with the inequality constraint imposed by H (this is the quantity 1=d in the proof of Proposition 2). We say that such samples are “evidential” of H . The R code that will compute this proportion is in line 7. Then, we need to compute the proportion of evidential samples in the prior (i.e., 1=c). Since the prior has known density   Cauchy(0; 1), we can use the pcauchy command to directly compute the proportion of values  that are less than 0; this computation proceeds 17 Listing 4: Computing a Bayes factor for a directional hypothesis using the encompassing approach 1 # directional hypothesis using encompassing priors 2 # H_0: "null" model: delta <= 0 3 # H_1: "alternative" model: delta > 0 4 # H_e: "encompassing" model: delta ~ Cauchy (unconstrained) 6 # H_0 versus H_e 7 postEvidential = mean(posteriorDelta <=0) 8 priorEvidential = pcauchy(0,0,1) 9 BF0e = postEvidential/priorEvidential 11 # H_1 versus H_e 12 postEvidential = mean(posteriorDelta >0) 13 priorEvidential = 1-pcauchy(0,0,1) 14 BF1e = postEvidential/priorEvidential 16 # H_1 versus H_0 17 BF10 = BF1e/BF0e; BF10 in line 8. Then, by Proposition 2, we can simply divide these two quantities to compute B (see line 0e 9). Next, we do a similar computation with H versus the encompassing model H , shown in lines 1 e 12-14. This gives us a value for B . Now, we can compute the Bayes factor for H over H by 1e 1 0 computing B =B  65, indicating that the observed data are approximately 65 times more likely 1e 0e under the alternative model H :  > 0 compared to the null model H :   0. 1 0 This approach can be extended to test a wide variety of hypotheses involving inequality con- straints. One particular advantage of the encompassing approach is that it gives us the ability to test interval null hypotheses – that is, hypotheses of the form H : " <  < ". To illustrate, consider the analyst who is not interested in whether an e ect is exactly 0, but rather, is interested in whether an e ect is larger than threshold, say " = 0:2. An example of such computation is displayed in Listing 5. Like in the example above, we define three hypotheses: two competing hypotheses H : jj < " and H : jj > ", both nested within an 0 1 encompassing modelH : . In the example, I have set " = 0:20, but one can set this value at whatever value seems reasonable for the given context. As in the example before, we use the posterior samples for  under H that were generated in Listing 1 to calculate the proportions of the posterior that satisfied the inequality constraints on imposed byH andH . These computations are performed in lines 9 and 14. The relevant proportions 0 1 of the unconstrained prior that obey the imposed inequality constraints are again calculated using the pcauchy command, as seen in lines 10 and 15. Finally, lines 11,16, and 19 calculate the relevant Bayes factors. As we can see, the Bayes factor B  2:2, indicating that the observed data are 2.2 times more likely under the model H : jj > " compared to the model H : jj < ". Notice that this 1 0 is similar to, but less than, the Bayes factor obtained with the point null hypothesis H :  = 0 from earlier in the paper. 18 Listing 5: Computing a Bayes factor for an interval null hypothesis using the encompassing approach 1 # interval null via encompassing priors 2 # H_0: "null" model: |delta| < epsilon 3 # H_1: |delta| > epsilon 4 # H_e: delta ~ Cauchy (unconstrained) 6 epsilon = 0.2 8 # H_0 versus H_e 9 postEvidential = mean(abs(posteriorDelta) < epsilon) 10 priorEvidential = pcauchy(epsilon ,0,1)-pcauchy(-epsilon ,0,1) 11 B_0e = postEvidential/priorEvidential 13 # H_1 versus H_e 14 postEvidential = mean(abs(posteriorDelta) > epsilon) 15 priorEvidential = (1-pcauchy(epsilon ,0,1)) + pcauchy(-epsilon ,0,1) 16 B_1e = postEvidential/priorEvidential 18 # H_1 versus H_0 19 B_10 = B_1e/B_0e; B_10 5. Conclusions In this tutorial, I have demonstrated a flexible approach to extending the default JZS t-test, a Bayesian test that is becoming increasingly popular in the social and behavioral sciences (Rouder et al., 2009). The approach uses two theoretical results, the Savage-Dickey density ratio (Dickey and Lientz, 1970) and the method of encompassing priors (Klugkist et al., 2005) in combination with an easy-to-use probabilistic modeling package for R called Greta (Golding, 2018). Though the examples presented in this paper are quite trivial to implement, they provide the reader with a general workflow that can be extended to solve problems relevant to his or her own work. Inherent in the techniques presented here is flexibility; the user has complete freedom to specify the underlying models and specific model comparisons in any way that he or she wishes. Finally, the Greta modeling language is easy to learn and readily extends to more complex modelsy. Furthermore, by harnessing the power of Google Tensorflow (Abadi et al., 2015), the MCMC sampler is fast, with all models described in the paper converging in less than one minute. In summary, I think this is an advantageous approach to using de- fault Bayesian tests for common hypothesis testing scenarios, especially those common in the social, behavioral, and other applied sciences. Acknowledgement I am grateful to the handling editor and two anonymous referees for their comments on an earlier version of this manuscript. References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, 19 R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, ´ D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vieg ´ as, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. Carpenter, B., Gelman, A., Ho man, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1). Dickey, J. M. and Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a markov chain. The Annals of Mathematical Statistics, 41(1):214–226. Faulkenberry, T. J. (2018). Computing Bayes factors to measure evidence from experiments: An extension of the BIC approximation. Biometrical Letters, 55(1):3143. Gabry, J. and Mahr, T. (2018). bayesplot: Plotting for Bayesian Models. R package version 1.6.0. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409. Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5):587–606. Gilks, W. R., Thomas, A., and Spiegelhalter, D. J. (1994). A language and program for complex bayesian modelling. The Statistician, 43(1):169–177. Golding, N. (2018). greta: Simple and Scalable Statistical Modelling in R. R package version 0.3.0.9001. Hoekstra, R., Morey, R. D., Rouder, J. N., and Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5):1157–1164. Hoel, P. G. (1984). Introduction to Mathematical Statistics. John Wiley & Sons, New York, 5th edition. JASP Team (2018). JASP (Version 0.9)[Computer software]. Je reys, H. (1961). The Theory of Probability (3rd ed.). Oxford University Press, Oxford, UK. Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795. Killeen, P. R. (2007). Replication statistics as a replacement for significance testing: Best practices in scientific decision-making. In Osborne, J. W., editor, Best practices in quantitative methods, pages 103–124. SAGE Publications, Inc., Thousand Oaks, CA. Klugkist, I., Kato, B., and Hoijtink, H. (2005). Bayesian model selection using encompassing priors. Statistica Neerlandica, 59(1):57–69. Kooperberg, C. (2018). polspline: Polynomial Spline Routines. R package version 1.1.13. Kooperberg, C. and Stone, C. J. (1992). Logspline density estimation for censored data. Journal of Computational and Graphical Statistics, 1(4):301–328. 20 Lunn, D. J., Thomas, A., Best, N., and Spiegelhalter, D. (2000). WinBUGS-a bayesian modelling framework: concepts, structure, and extensibility. Statistics and computing, 10(4):325–337. Masson, M. E. J. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis significance testing. Behavior Research Methods, 43(3):679–690. Morey, R. D. and Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16(4):406–419. Morey, R. D. and Rouder, J. N. (2018). BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.12-4.2. Neal, R. (2011). Mcmc using hamiltonian dynamics. In Brooks, S., Gelman, A., Jones, G., and Meng, X., editors, Handbook of Markov Chain Monte Carlo, pages 116–162. Chapman and Hall/CRC. Oakes, M. (1986). Statistical Inference: A commentary for the social and behavioural sciences. John Wiley & Sons, Chicester. Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sam- pling. R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25:111–163. Richard, F. D., Bond, C. F., and Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7(4):331–363. Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., and Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2):225–237. Stone, C. J., Hansen, M. H., Kooperberg, C., and Truong, Y. K. (1997). Polynomial splines and their tensor products in extended linear modeling: 1994 Wald memorial lecture. The Annals of Statistics, 25(4):1371–1470. Wagenmakers, E., Wetzels, R., Borsboom, D., and van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3):426–432. Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., and Grasman, R. (2010). Bayesian hypothesis test- ing for psychologists: A tutorial on the SavageDickey method. Cognitive Psychology, 60(3):158– Wang, M. (2017). Mixtures of g -priors for analysis of variance models with a diverging number of parameters. Bayesian Analysis, 12(2):511532. Wetzels, R., Grasman, R. P., and Wagenmakers, E.-J. (2010). An encompassing prior generalization of the savagedickey density ratio. Computational Statistics & Data Analysis, 54(9):20942102. 21 Wetzels, R., Raaijmakers, J. G. W., Jakab, E., and Wagenmakers, E.-J. (2009). How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default bayesian t test. Psychonomic Bulletin & Review, 16(4):752–760. Zellner, A. and Siow, A. (1980). Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa, 31(1):585–603. Appendix: In this appendix, I report a simulation study designed to benchmark performance of the posterior sampling generalization of the JZS t-test described in this paper against the version of the JZS test originally proposed by Rouder et al. (2009). For each simulation, I randomly generated 200 single-sample data sets of size N (where N = 20, 50, or 80) under the model y =  +" . For each of these data sets, di erent “e ects” were represented i i by varying the parameter , which was assumed to be drawn randomly from a normal distribution with mean 0 and variance g (where g = 0, 0.05, or 0.2; also see Wang, 2017; Faulkenberry, 2018). Each of the errors " for a given data set was drawn from a normal distribution with mean 0 and variance 1. The resulting combinations of 3 di erent sample sizes (N = 20; 50; 80) and 3 di erent e ect parameters (g = 0; 0:05; 0:2) produced a total of nine simulations. Once a simulated data set was constructed, I computed the Bayes factor B for the null hypothesis H :  = 0 over the alternative hypothesisH :  = 1 using two methods: (1) the JZS Bayes factor of 0 1 Rouder et al. (2009), and (2) the posterior sampling technique. The JZS Bayes factor was computed using the ttestBF function from the BayesFactor package in R, and the posterior sampling Bayes factor was computed using the methods described in this paper in Section 4.1 (i.e., drawing posterior samples using Greta, fitting a logspline estimate of the posterior, and then computing the Savage- Dickey density ratio by comparing the ordinates of  = 0 in the posterior and prior, respectively). Each Bayes factor was computed using a Cauchy prior of scale r = 1. In all, I found the two methods to be quite comparable to each other. To see why, let’s first inspect the distributions of Bayes factors obtained for each of the nine combinations of N and g. Figure A.1 shows these via overlaid density plots of log(B ) for each computation method. As one can readily see in Figure A.1, the density plots have considerable overlap, indicating that both methods produced very similar distributions of Bayes factors. Further evidence for the compatibility of the two techniques comes from Table A.1, shows five- number summaries for the values of log(B ) obtained for each condition. Additionally, I computed model selection consistency, defined as the proportion of simulated data sets for which the JZS Bayes factor and the posterior sampling Bayes factor led to the same model choice. For this computation, model selection was determined by computing log(B ). If log(B ) > 0, thenH was selected; other- 01 01 0 wise,H was selected (see also Faulkenberry, 2018). As is shown in Table A.1, the posterior sampling technique again produced a distribution of Bayes factors that was very similar to those obtained from the JZS Bayes factor, mirroring what is shown in Figure A.1. Critically, both computation methods selected the same model in a large percentage of the simulated data sets, as indicated by the large consistency values in Table A.1. In all, the proposed sampling method for computing Bayes factors described in this tutorial seems to be quite consistent with the established, albeit less flexible, JZS Bayes factor of Rouder et al. (2009). Thus, the researcher can be confident that the posterior sampling methods described in this paper not only a ord a great deal of flexibility, but also benchmark well against other established methods of computation. 22 g=0, N=20 g=0, N=50 g=0, N=80 1.5 1.0 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 −2 −1 0 1 2 −2 −1 0 1 2 −4 −2 0 2 g=0.05, N=20 g=0.05, N=50 g=0.05, N=80 0.4 0.2 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.0 −4 −2 0 2 −8 −4 0 −6 −3 0 3 g=0.2, N=20 g=0.2, N=50 g=0.2, N=80 0.15 0.09 0.2 0.10 0.06 0.1 0.05 0.03 0.0 0.00 0.00 −20 −15 −10 −5 0 −25 −20 −15 −10 −5 0 −30 −20 −10 0 log(BF) log(BF) log(BF) Figure A.1: Density plots of log(B ) for each of the nine combinations of ”e ect” g (0,0.05,0.2) and sample size N (20,50,80). The JZS Bayes factor distribution is displayed as a solid line, whereas the distribution of posterior sampling Bayes factors is displayed as a dashed line. Table A.1: Five-number summary of log(B ) and model selection consistency. g N BF type Min Q Median Q Max Consistency 1 3 0 20 JZS -1.77 -1.71 -1.51 -1.12 1.76 sampling -1.95 -1.70 -1.51 -1.08 1.88 0.985 50 JZS -2.20 -2.16 -2.01 -1.56 1.51 sampling -2.47 -2.16 -1.99 -1.53 2.49 0.985 80 JZS -2.43 -2.38 -2.20 -1.69 4.21 sampling -2.55 -2.36 -2.18 -1.61 2.90 1.000 0.05 20 JZS -4.43 0.23 1.11 1.61 1.77 sampling -4.95 0.27 1.12 1.62 2.04 0.995 50 JZS -10.86 0.08 1.48 1.97 2.20 sampling -4.95 0.27 1.12 1.62 2.04 0.975 80 JZS -8.09 -0.79 1.14 2.06 2.43 sampling -6.96 -0.79 1.16 2.08 2.51 0.945 0.2 20 JZS -10.98 -1.76 0.39 1.44 1.77 sampling -19.90 -1.81 0.25 1.42 1.86 0.975 50 JZS -23.85 -4.25 0.05 1.51 2.20 sampling -12.39 -3.19 0.01 1.49 2.36 0.965 80 JZS -37.79 -6.78 -1.13 1.67 2.43 sampling -25.60 -4.77 -1.11 1.85 2.65 1.000 density density density

Journal

StatisticsarXiv (Cornell University)

Published: Dec 7, 2018

There are no references for this article.