Access the full text.
Sign up today, get DeepDyve free for 14 days.
Steve Brooks, A. Gelman, Galin Jones, X. Meng (2012)
Handbook of Markov Chain Monte Carlo: Hardcover: 619 pages Publisher: Chapman and Hall/CRC Press (first edition, May 2011) Language: English ISBN-10: 1420079417CHANCE, 25
A. Raftery (1995)
Bayesian Model Selection in Social ResearchSociological Methodology, 25
Ruud Wetzels, R. Grasman, E. Wagenmakers (2010)
An encompassing prior generalization of the Savage-Dickey density ratioComput. Stat. Data Anal., 54
Ruud Wetzels, J. Raaijmakers, E. Jakab, E. Wagenmakers (2009)
How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian t testPsychonomic Bulletin & Review, 16
Jeffrey Rouder, P. Speckman, Dongchu Sun, R. Morey, G. Iverson (2009)
Bayesian t tests for accepting and rejecting the null hypothesisPsychonomic Bulletin & Review, 16
David Lunn, Andrew Thomas, N. Best, D. Spiegelhalter (2000)
WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibilityStatistics and Computing, 10
M. Oakes (1986)
Statistical Inference: A Commentary for the Social and Behavioural Sciences
N. Perry (1955)
Book Reviews : Introduction to Mathematical Statistics (2nd Ed.), by Paul G. Hoel. New York: John Wiley and Sons, Inc., I954. Pp. xi + 33I. $5.00Educational and Psychological Measurement, 15
Radford Neal (2011)
Handbook of Markov Chain Monte Carlo
Martín Abadi, Ashish Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, Andy Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Yangqing Jia, R. Józefowicz, Lukasz Kaiser, M. Kudlur, J. Levenberg, Dandelion Mané, R. Monga, Sherry Moore, D. Murray, C. Olah, M. Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, P. Tucker, Vincent Vanhoucke, Vijay Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Yuan Yu, Xiaoqiang Zheng (2016)
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed SystemsArXiv, abs/1603.04467
A. Gelfand, Adrian Smith (1990)
Sampling-Based Approaches to Calculating Marginal DensitiesJournal of the American Statistical Association, 85
Teri Granger, Kim Harper, Tom Hruby (2005)
Team Guiding Production of Volume 1
P. Hoel (1947)
Introduction to Mathematical Statistics.Economica, 110
Robert Kass, A. Raftery (1995)
Bayes factors
R. Morey, Jeffrey Rouder (2011)
Bayes factor approaches for testing interval null hypotheses.Psychological methods, 16 4
(2018)
polspline: Polynomial Spline Routines
(2018)
bayesplot: Plotting for Bayesian Models
N. Golding (2019)
greta: simple and scalable statistical modelling in RJ. Open Source Softw., 4
J. Dickey, B. Lientz (1970)
The Weighted Likelihood Ratio, Sharp Hypotheses about Chances, the Order of a Markov ChainAnnals of Mathematical Statistics, 41
G. Reinsel, R. Hogg, A. Craig (1980)
Introduction to Mathematical Statistics (4th ed.).Journal of the American Statistical Association, 75
Thomas Faulkenberry (2018)
Computing Bayes factors to measure evidence from experiments: An extension of the BIC approximationBiometrical Letters, 55
Listing 5: Computing a Bayes factor for an interval null hypothesis using the encompassing approach
I. Klugkist, B. Kato, H. Hoijtink (2005)
Bayesian model selection using encompassing priorsStatistica Neerlandica, 59
M. Masson (2011)
A tutorial on a practical Bayesian alternative to null-hypothesis significance testingBehavior Research Methods, 43
W. Gilks, A. Thomas, D. Spiegelhalter (1994)
A Language and Program for Complex Bayesian ModellingThe Statistician, 43
Min Wang (2017)
Mixtures of $g$-Priors for Analysis of Variance Models with a Diverging Number of ParametersBayesian Analysis, 12
R. Morey, Jeffrey Rouder (2018)
Baysefactor: Computation of Bayes Factors for Common Designs
Bob Carpenter, A. Gelman, M. Hoffman, Daniel Lee, Ben Goodrich, M. Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, A. Riddell (2017)
Stan: A Probabilistic Programming Language.Journal of statistical software, 76
F. Richard, C. Bond, J. Stokes-Zoota (2003)
One Hundred Years of Social Psychology Quantitatively DescribedReview of General Psychology, 7
E. Wagenmakers, Ruud Wetzels, D. Borsboom, H.L.J. Maas (2011)
Why Psychologists Must Change the Way They Analyze Their Data: the Case of Psi: Comment on Bem (2011)
(1961)
The Theory of Probability (3rd ed.)
Rink Hoekstra, R. Morey, Jeffrey Rouder, E. Wagenmakers (2013)
Robust misinterpretation of confidence intervalsPsychonomic Bulletin & Review, 21
C. Kooperberg, C. Stone (1992)
Logspline Density Estimation for Censored DataJournal of Computational and Graphical Statistics, 1
(1984)
JASP ( Version 0 . 9 ) [ Computer software ]
R. Team (2014)
R: A language and environment for statistical computing.MSOR connections, 1
null" model: |delta| < epsilon
H_e: delta˜Cauchy ( unconstrained ) = mean(abs( posteriorDelta ) < epsilon )
E. Wagenmakers, T. Lodewyckx, Himanshu Kuriyal, R. Grasman (2010)
Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey methodCognitive Psychology, 60
G. Gigerenzer (2004)
Mindless statistics
John Klein, H. Houwelingen, Joseph Ibrahim, T. Scheike (2016)
Bayesian Model Selection
S. Iacus, Davide Torre (2003)
Fractals and Statistics: An R Package Called Ifs
A. Zellner, A. Siow (1980)
Posterior odds ratios for selected regression hypothesesTrabajos de Estadistica Y de Investigacion Operativa, 31
C. Stone, M. Hansen, C. Kooperberg, Y. Truong (1997)
Polynomial splines and their tensor products in extended linear modeling: 1994 Wald memorial lectureAnnals of Statistics, 25
(2007)
Replication statistics as a replacement for significance testing: best practices in scientific decision-making
With the advent of so-called “default” Bayesian hypothesis tests, scientists in applied ﬁelds have gained ac- cess to a powerful and principled method for testing hypotheses. However, such default tests usually come with a compromise, requiring the analyst to accept a one-size-ﬁts-all approach to hypothesis testing. Further, such tests may not have the ﬂexibility to test problems the scientist really cares about. In this tutorial, I demonstrate a ﬂexible approach to generalizing one speciﬁc default test (the JZS t-test; Rouder et al., 2009) that is becoming increasingly popular in the social and behavioral sciences. The approach uses two results, the Savage-Dickey density ratio (Dickey & Lientz, 1980) and the technique of encompassing priors (Klugkist et al., 2005) in com- bination with MCMC sampling via an easy-to-use probabilistic modeling package for R called Greta. Through a comprehensive mathematical description of the techniques as well as illustrative examples, the reader is presented with a general, ﬂexible workﬂow that can be extended to solve problems relevant to his or her own work. Note: this paper is in press at Communications for Statistical Applications and Methods. Keywords: Bayes factors, Bayesian inference, hypothesis testing, MCMC sampling, JZS t-test, Savage-Dickey density ratio, encompassing priors 1. Introduction The t-test is one of the simplest, yet most enduring, examples of a hypothesis test that the social and behavioral scientist uses in his or her daily work. In the typical framework of null hypothesis signiﬁcance testing (NHST), the t-test works by ﬁrst assuming a null hypothesis, and then calcu- lating a t-score, which indexes the likelihood of obtaining some sample of observed data under the null hypothesis. If this probability is small, the scientist rejects the null in favor of some alternative hypothesis. Consider the following scenario that is often used when assessing the eect of some treatment. Let th x and x denote measurements for the i participant in two dierent conditions (e.g., a pretest and i1 i2 posttest). Consider the dierence d = x x . A typical consideration is whether these dierences i i2 i1 are dierent from 0; answering this question in the armative would then imply that the treatment had some nonzero eect. To answer this question, one can apply the standard one-sample t-test, which works by ﬁrst assuming d Normal(; ); and then deﬁning two competing hypotheses: a null hypothesis H : = 0 and an alternative hypoth- esis H : , 0. We then test H by computing 1 0 t = ; s= n Corresponding author: Assistant Professor, Department of Psychological Sciences, Tarleton State University, Stephenville, Texas, USA. E-mail: faulkenberry@tarleton.edu arXiv:1812.03092v2 [stat.CO] 5 Feb 2019 2 where d is the mean of the dierences d across all participants i = 1; : : : ; n, s is the sample standard deviation of the dierence scores d , and n is the sample size. Under the null H , the distribution of t i 0 is well known as Student’s t distribution, with density function 2 2 f (x) = 1 + ; where represents degrees of freedom, and x 2 (1;1). The cumulative distribution function F(x) = f (u)du can then be used to index the probability of observing data at least as extreme as that which we observed under the null hypothesis H . Speciﬁcally, we compute p(jxj > t) = 1 F(t), a quantity commonly known as a p-value. If this probability is small (say, less than 5%), then one may decide to reject H and conclude that , 0, thus implying that our treatment had some nonzero eect. This idea is well known to practicing social and behavioral scientists. However, there are some properties of this procedure that are suboptimal for robust inference. For one, the procedure is asym- metric (Rouder et al., 2009). Suppose that one calculates a p-value above some commonly-used threshold like 5%. What decision does the researcher make? Surely the logical opposite of “reject H ” is “accept H ”. However, this decision rule is inconsistent. The reason for this follows from 0 0 examining the distribution of p-values that result from increasing sample sizes. When the null is false (i.e., , ), the value of t increases as sample sizes increase. Thus, the probability of rejecting H 1 2 0 increases accordingly. However, if the null is true, p-values are uniformly distributed between 0 and 1, regardless of sample size. So, whereas a false null hypothesis can always be rejected if sample size is large enough, a true null hypothesis is always susceptible to being incorrectly rejected. Such incon- sistency leads to asymmetry in the testing procedure – increasing sample size can increase evidence against a false null hypothesis, but there is no corresponding way to increase evidence for a true null hypothesis. Another criticism of the traditional hypothesis testing procedure is that researchers often misin- terpret the results of such tests. Hoekstra et al. (2014) asked 562 researchers and students from the ﬁeld of psychology to assess the validity of six dierent statements involving incorrect interpretations of conﬁdence intervals (e.g., “The probability that the true mean is greater than 0 is at least 95%”). Although each of these statements was false, both students and researchers on average believed at least 3 of the statements were true. Furthermore, researchers did no better than students with respect to these misunderstandings. This ﬁnding echos results by Oakes (1986), who performed a similar study using statements about p-values; see also Gigerenzer (2004). In light of these criticisms, the social and behavioral sciences have seen an increase in recommen- dations to ﬁnd alternatives to the orthodox use of null hypothesis signiﬁcance testing (Wagenmakers et al., 2011). One such alternative is to use Bayesian inference, and in particular Bayes factors (Kass and Raftery, 1995; Raftery, 1995; Masson, 2011). Bayesian inference is based on calculating the posterior probability of a hypothesis H after observing data y. This calculation proceeds by Bayes’ theorem, which states p(y j H ) p(H ) p(H j y) = : (1.1) p(y) One way to think of Equation 1.1 is as follows: prior to observing data, one assigns a prior belief p(H ) to a hypothesis H . Once the data y have been observed, one updates this prior belief to a posterior belief p(H j y) by multiplying the prior p(H ) by the likelihood p(y j H ). This product is then rescaled to meet the requirements for being probability distribution (i.e., total probability = 1) by dividing by p(y), the marginal probability of the observed data averaged across all possible hypotheses H . 3 While this computation is fundamentally quite basic, one immediate consequence is how it can be used for comparing two hypotheses. Suppose as above that we have two competing hypotheses: a null hypothesisH and an alternative hypothesisH . We can directly compare our posterior beliefs in 0 1 these two hypotheses by computing their ratio p(H j y)= p(H j y), which we call the posterior odds 0 1 for H over H . Using Bayes’ theorem (Equation 1.1), we can readily see 0 1 p(H j y) p(y j H ) p(H ) 0 0 0 = : (1.2) p(H j y) p(y j H ) p(H ) 1 1 1 | {z } | {z } | {z } posterior odds Bayes factor prior odds As with Bayes’ theorem, Equation 1.2 can also be interpreted in terms of an “updating” metaphor. Speciﬁcally, the posterior odds ratio is equal to the prior odds ratio multiplied by an updating factor. This updating factor is the ratio of the marginal likelihoods p(y j H ) and p(y j H ), and is called 0 1 the Bayes factor (Jereys, 1961; Kass and Raftery, 1995). The Bayes factor is the weight of evidence provided by data y. For example, suppose that one assigned the prior odds of H to H to be equal to 0 1 4-to-1; that is, we believe that, a priori, H is 4 times more likely to be true than H . Then, suppose 0 1 that after observing data, we compute a Bayes factor was computed to be 5. Now, the posterior odds (the odds of H over H after observing data) is 20-to-1 in favor of H over H . 0 1 0 1 There are two immediate advantages to using the Bayes factor for inference. First, the Bayes factor is a ratio, and thus, is subject to a natural interpretation. Simply put, larger is better - the bigger the Bayes factor, the bigger the weight of evidence provided by the observed data. Second, since there was no speciﬁc assumption about the order in which we addressed H and H , we could have just as 0 1 easily measured the weight of evidence in favor of H over H . In fact, once we have a Bayes factor 1 0 in favor of one hypothesis, a simple reciprocal will give us the Bayes factor in favor of the the other hypothesis. In our example above, the Bayes factor for H over H would have been 1/5, implying 1 0 that the data would actually decrease our relative belief in H over H . Because we can compute 1 0 Bayes factors from either direction, we must be careful to deﬁne our notation carefully. In this paper, I will adopt the common convention to deﬁne B as the Bayes factor for H over H . Similarly, B 01 0 1 10 would represent the Bayes factor for H over H . Note that, by our discussion above, B = 1=B . 1 0 01 10 Though the previous discussion certainly speaks positively about the beneﬁts of using the Bayes factor as a tool for inference, there are some important considerations that the researcher must address before implementing it as a tool for inference. First, as we’ll see in the discussion below, the Bayes factor requires the analyst to specify prior distributions on all parameters in the underlying model. Thus, a given Bayes factor reﬂects a speciﬁc choice of prior. Second, the computation of these Bayes factors is usually nontrivial. For example, computing the either the numerator or denominator of Equation 1.2 requires explicitly deﬁning the hypothesisH as a model consisting of vectors in some parameter space and integrating the likelihood weighted by a prior distribution on these parameters; that is, p(y j H ) = f (y j ;H ) p( j H )d; i i i where f is the likelihood function, and p denotes the prior distribution on parameters 2 under model H . However, the last decade has seen the development of many tools intended to simplify the calculation of Bayes factors for the common models used by applied researchers, including online calculators, standalone software packages such as JASP (JASP Team, 2018), and a wide range of packages for the statistical computing environment R (R Core Team, 2018), including the package BayesFactor (Morey and Rouder, 2018). Because these solutions are designed to work across a variety 4 of contexts, one must necessarily assume some defaults with respect to the models that underly these calculators. Many have argued that these defaults represent reasonable assumptions about the types of problems with which many applied researchers are concerned. However, recent advances in statistical computing have made it easier for the practicing researcher to build his or her own custom models for a given situation and compute Bayes factors to compare these models. In this paper, I will provide a tutorial with particular focus on extending one type of Bayesian model comparison known as the Jereys-Zellner-Siow t-test (Rouder et al., 2009) (henceforth abbre- viated as JZS t-test). Speciﬁcally, I will describe a generalization that provides an adaptable, compu- tationally ecient method for computing Bayes factors in a variety of single-sample and independent- samples designs. The organization of the paper is as follows. First, I will describe the mathematical underpinnings of the JZS t-test. Then, I will present two results which allow us to generalize the JZS t-test to a broader class of model comparisons: (1) the Savage-Dickey density ratio (Dickey and Lientz, 1970; Wagenmakers et al., 2010; Wetzels et al., 2009), which is used for comparing models in which one is a sharp hypothesis (e.g, a point null hypothesis) nested within another unconstrained model; and (2) the encompassing prior technique (Klugkist et al., 2005), which is used for comparing nested models with ordinal constraints. Finally, I will demonstrate (with examples) how to use these techniques along with posterior sampling (Gelfand and Smith, 1990) to compute Bayes factors for model comparisons involving both point-null hypotheses (i.e., testing whether an eect is exactly 0), directional hypotheses (i.e., testing whether an eect is postive compared to whether it is negative), and interval-null hypotheses (i.e., testing whether an eect is approximately 0). 2. The JZS t-test The JZS t-test (Rouder et al., 2009) was developed as a default Bayesian version of the orthodox t-test described above. We will denote by y a vector of observed data, and assume as above that y is normally distributed with mean and variance . We can then explicitly deﬁne two competing hypotheses: a null hypothesis H and an alternative hypothesis H . Both hypotheses can be parameterized with 0 1 two parameters, and . Under the alternative hypothesisH , we allow and to freely vary. That is, H is an unconstrained model; could be positive, negative, or zero. For the null hypothesis H , 1 0 which states that the mean of data y is equal to 0, we can simply constrain to be 0. Thus, we say that H is nested within H . Under these models, we can compute the Bayes factor B = p(y j H )= p(y j 0 1 01 0 H ), where 2 2 2 p(y j H ) = f (y j = 0; ;H ) p( ;H )d 0 0 0 and Z Z 1 1 2 2 2 p(y j H ) = f (y j ; ;H ) p(; ;H )d d: 1 1 1 1 0 2 2 These computations require placing priors on under the null model H and both and under the alternative model H . Following Jereys (1961) and Zellner and Siow (1980), Rouder et al. reparameterized the problem by placing a Cauchy prior on eect size = =. That is, under H , Cauchy(0; r), where r represents the scale of expected eect sizes, and underH , = 0. Rouder et 2 2 2 al. placed a Jerey’s prior on ; speciﬁcally, p( ) / 1= . With these default prior speciﬁcations, Rouder et al. (2009) showed that the Bayes factor can be computed as t 2 1 + B = ; (2.1) 1 1 2 1 3 t 1 2 2 2 (1 + Ngr ) 1 + (2) g exp dg 2g 0 (1+Ngr ) 5 where t is the orthodox t statistic, r is the scale on the eect size prior, N is the number of observations in y, and = N 1 denotes the degrees of freedom. Though computationally convenient, this JZS Bayes factor formula has a few disadvantages. First, it reﬂects a very speciﬁc choice of prior speciﬁcation. Though using a Cauchy prior on eect size may be a reasonable choice for many researchers, especially in the behavioral sciences (Rouder et al., 2009), others may argue for a dierent prior. For example, Killeen (2007) used meta-analytic data from Richard et al. (2003) to argued that eect sizes in social psychology are typically normally distributed with variance equal to 0.3. Certainly, one advantage of a Bayesian approach is that the prior on eect size should reﬂect the analyst’s prior belief on what eect sizes should be expected. For some ﬁelds of study, these eect sizes may be expected to be small (i.e., social psychology), whereas in other ﬁelds, these eect sizes may be expected to be larger. Also, note that the reason for using the Cauchy prior (instead of a normal prior) in the JZS Bayes factor is one of computational convenience; it simply makes the computation work out. If the analyst wants to use a dierent prior, the formula for the JZS Bayes factor in Equation 2.1 would have to be recomputed, a task which would only be accessible to those researchers with the appropriate mathematical background. Another disadvantage of the JZS Bayes factor is that it forces the analyst into a very speciﬁc hypothesis test; that is, H : = 0 versus H : , 0. One may be interested instead in more ﬂexible 0 1 testing situations – for example, testing a directional hypothesis H : > 0 versus H : 0, or 0 1 an interval hypothesis such as H : " < < " versus an alternative model H : jj > " (Morey 0 1 and Rouder, 2011). These tests would each require a major readjustment to the derivation of the JZS Bayes factor. Instead, I propose that we approach problems like these using a fundamentally dierent set of tools, which I will now describe. 3. Generalizing the JZS t-test In this section, I describe a method for extending the default JZS t-test of Rouder and colleagues to use a wider class of priors on eect size. This generalization thus allows the analyst more freedom to specify a prior that may better reﬂect his or her a priori expectation about what eect sizes are typically encountered in a given ﬁeld. The original description of this method is due to Wetzels et al. (2009) and Wagenmakers et al. (2010), though the software implementation and speciﬁc extensions I will describe are novel. The core method relies on a result known as the Savage-Dickey density ratio (Dickey and Lientz, 1970), which states that the Bayes factor for a pair of models in which one of the models is a one- point restriction of the other (i.e., H : = 0 versus H : , 0) is simply the ratio of the ordinates 0 1 of the one point of interest (i.e., = 0) in the posterior and prior densities, respectively. The technical formulation and proof will be presented momentarily. For now, however, one should appreciate that this can be a great simpliﬁcation over other methods of computing Bayes factors presented above, since there is no need for integration. Assuming one can estimate the prior and posterior densities via some sampling method (e.g., Markov chain Monte Carlo, or MCMC, sampling), then the computation of this ratio of densities is straightforward. The Savage-Dickey density ratio can stated rigorously as the following proposition: Proposition 1. (Savage-Dickey Density Ratio) Consider two competing models on data y containing parameters and ', namelyH : = ; ' andH : ; '. In this context, we say that is a parameter 0 0 1 of interest, ' is a nuisance parameter (i.e., common to all models), andH is a sharp point hypothesis nested within H . Suppose further that the prior for the nuisance parameter ' in H is equal to the 1 0 6 prior for ' in H after conditioning on the restriction – that is, p(' j H ) = p(' j = ;H ). Then 1 0 0 1 p( = j y;H ) 0 1 B = : p( = j H ) 0 1 Proof: By deﬁnition, the Bayes factor is the ratio of marginal likelihoods over H and H , respec- 0 1 tively. That is, p(y j H ) B = : (3.1) p(y j H ) The key idea in the proof is that we can use a “change of variables” technique to express B entirely in terms ofH . This proceeds by ﬁrst unpacking the marginal likelihood forH over the nuisance pa- 1 0 rameter ' and then using the fact thatH is a sharp hypothesis nested withinH to rewrite everything 0 1 in terms of H . Speciﬁcally, p(y j H ) = p(y j ';H ) p(' j H )d' 0 0 0 = p(y j '; = ;H ) p(' j = ;H )d' 0 1 0 1 = p(y j = ;H ): 0 1 By Bayes’ Theorem, we can rewrite this last line as p( = j y;H ) p(y j H ) 0 1 1 p(y j = ;H ) = : 0 1 p( = j H ) 0 1 Thus we have p(y j H ) 1 B = = p(y j H ) 01 0 p(y j H ) p(y j H ) 1 1 = p(y j = ;H ) 0 1 p(y j H ) p( = j y;H ) p(y j H ) 1 0 1 1 p( = j H ) p(y j H ) 0 1 1 p( = j y;H ) 0 1 = : p( = j H ) 0 1 The beauty of Proposition 1 is that it allows one to calculate the Bayes factor for a point null hypothesis (i.e, H : = 0) by simply computing two densities: (1) the density of = 0 in the posterior, and (2) the density of = 0 in the prior. Then, the Bayes factor results by taking the ratio of these posterior and prior densities, respectively. Given this result, this changes the problem of computing Bayes factors from one of integration (e.g, the JZS Bayes factor) to one of estimating prior and posterior densities. 7 3.1. Computing the Savage-Dickey density ratio The Savage-Dickey density ratio is an elegant solution to the problem of computing Bayes factors in situations involving a point null hypothesis H . All that is required is that one can compute samples from the posterior of an eect size parameter under a speciﬁed alternative model H . I think casting the problem in the context is preferable, not only for its ﬂexibility, but especially given the broad class of computer methods now available for sampling posteriors in Bayesian models, including BUGS (Lunn et al., 2000), JAGS (Plummer, 2003), and Stan (Carpenter et al., 2017). I will now focus on one recent addition to this collection – Greta (Golding, 2018). Greta is an R package designed for sampling from Bayesian models. It provides a reasonably simple language for modeling that is implemented directly within R, eliminating the need for writ- ing models in another language (e.g., JAGS, Stan) and then having to call these external ﬁles from within R. Further, Greta uses the computational power of Google TensorFlow (Abadi et al., 2015), so it provides fast convergence based on Hamiltonian Monte Carlo sampling (Neal, 2011), it scales well to very large datasets, and it can even be conﬁgured to run on GPUs, providing the ability for massive parallel computation. Moreover, it is a free download from the Comprehensive R Archive 1; Network (CRAN) , and as such can be installed directly from within R by typing the command install.packages(``greta'') at the R console. Note that ﬁtting models with Greta will require the user to have a working installation of Python packages for TensorFlow (version 1.10.0 or higher) and tensorﬂow-probability (version 0.3.0 or higher). Once Greta is installed, the startup message will provide the user with system-speciﬁc instructions on how to install these two packages. While this step can be tricky, most errors can be addressed by following the recommendations on the Greta help 2; 3; page and the TensorFlow help page . To illustrate how Greta works, we will ﬁrst look at a model inspired by that which was initially described by Wetzels et al. (2009) as an alternative to the JZS t-test. As is common in these types of models, the model is depicted as a graphical model (Gilks et al., 1994) in Figure 1. In such graphical models, we use the various nodes to represent all variables of interest. Dependencies between these variables are indicated with graph structure. Deterministic nodes are denoted as rhombuses (i.e., rotated squares), whereas stochastic nodes are represented by unshaded circles. Finally, we denote observed variables by shaded nodes. In this model, our data y is assumed to be drawn from a normal distribution with mean and variance . As in the discussion of the JZS t-test above, we consider eect size = = as our main parameter of interest. For a fully Bayesian speciﬁcation, we must place priors on and . For this ﬁrst example, we follow Rouder et al. (2009) and Wetzels et al. (2009) and adopt their recommendations of placing a half-Cauchy prior on (that is, one of the symmetric halves of the Cauchy(0,1) distribution that is deﬁned for positive numbers only). Critically, we assume that eect size is distributed as Cauchy(0,1); the scale value of 1 indicates that, a priori, we believe that 50% of our eect sizes would lie between -1 and 1. Of course, as we’ll see below, if this doesn’t reﬂect the analyst’s prior belief about , this prior can be easily changed. This choice of model allows us to deﬁne two competing hypotheses: H : = 0 H : Cauchy(0; 1) https://CRAN.R-project.org/package=greta https://greta-stats.org/articles/get started.html https://tensorﬂow.rstudio.com/tensorﬂow/articles/installation.html 8 δ ∼ Cauchy(0, 1) μ = δσ σ ∼ Cauchy(0, 1) y ∼ Normal(μ, σ ) Figure 1: A graphical model for a posterior sampling Bayesian t-test. Table 1: Example data for a single-sample t test Patient 1 2 3 4 5 6 7 8 9 10 Hours gained 0.7 -1.1 -0.2 1.2 0.1 3.4 3.7 0.8 1.8 2.0 AsH is a sharp hypothesis nested withinH , we can apply Proposition 1 (the Savage-Dickey density 0 1 ratio) and compute p( = 0 j y;H ) B = : p( = 0 j H ) To do this, we’ll need to draw samples from the posterior distribution of under H and estimate the height of = 0 in an estimated density function for this posterior. All of this can be done in R, as I will now illustrate. To begin, let us consider a simple example, the type of which can be found in most elementary statistics textbooks. This example comes from Hoel (1984). Suppose that 10 patients take part in an experiment on a new drug that is supposed to increase sleep in the patients. Table 1 shows the hours of sleep gained by each patient (negative values indicate lost sleep). Assuming that the sample data are normally distributed with mean and variance , we can test the hypothesis H : = 0 against H : , 0. The R code necessary to perform a Bayesian t-test on this data is displayed in Listing 1. The ﬁrst step will be to load the Greta library (see line 2). After this, we need to assign our sample data to a vector y and then convert these to z-scores (see lines 5-6). The next step is to deﬁne the prior distributions on and . The Greta syntax allows this to be done in a quite straightforward manner (see lines 9-10). Further, any deterministic operations should then be deﬁned, as we do in line 13. Then, we can deﬁne our likelihood for the z-scores. The wording of the syntax has a nice advantage here, as it describes exactly what we are assuming about our scores; namely, that they are normally distributed with mean and variance (see line 16). The last step in setting up the model is to deﬁne the model; that is, we collect all of the variables of interest in our analysis. We have three: , , and , which we collect together in line 19. Now we are ﬁnally ready to sample from the posterior distributions of the variables in our model. We will focus our interest on delta, but Greta will automatically sample all posteriors for us. This step, displayed in line 22, will take a little while, depending on computing resources. Once the sampling is complete, there are two ways to inspect the samples before proceeding to our inference. The ﬁrst is to type summary(draws); this will show us various descriptive statistics 9 Listing 1: Building and sampling from the single-sample t-test model 1 # load libraries 2 library(greta) 4 # data from Hoel (1984) 5 y = c(0.7,-1.1,-0.2,1.2,0.1,3.4,3.7,0.8,1.8,2.0) 6 z = y/sd(y) 8 # priors 9 delta = cauchy(0,1) 10 sigma = cauchy(0,1,truncation = c(0,Inf)) 12 # operations 13 mu = delta*sigma 15 # likelihood 16 distribution(z) = normal(mu,sigma) 18 # define model 19 m = model(mu, sigma , delta) 21 # draw samples 22 draws = mcmc(m, n_samples=5000) of the samples, including mean, standard deviation, standard error, and quantiles. In our example, there will be three lines of output; one for each of , , and . Another way to look at the samples is to inspect the path of the samples over time as they explore the posterior distributions. This is done 4; by using the mcmc_trace command from the bayesplot package (Gabry and Mahr, 2018). If the samples converged appropriately, one should see the characteristic “hairy caterpillar” plot, indicating that the chains mixed well and truly randomly explored the posteriors. We will now look at how to compute Bayes factors necessary to compare the models H and H . 0 1 We will do this by computing the Savage-Dickey density ratio. Recall from Proposition 1 that in order to compute B , we simply need to compute the ordinate of = 0 in the densities of the prior and posterior, and then take their ratio. We already know the density function for the prior, and it is implemented in R as the dcauchy function. However, since we are using samples to approximate the posterior, we need a way to estimate its density function from the samples. One such method is to use a logspline density estimator (Kooperberg and Stone, 1992; Stone et al., 1997), which is implemented 5; by the function logspline from the R package polspline (Kooperberg, 2018). Listing 2 shows the R code necessary to both (1) plot the ordinates from the prior and posterior densities for = 0 under H , and (2) compute BF as the ratio of these ordinates. The ﬁrst step is 1 01 to extract the relevant samples from the posterior for from the object draws (see line 4). Note that if the analyst is interested in other parameters (e.g., or , the [,3] part of line 4 can be adjusted appropriately. We can then perform the logspline estimate of the posterior density for , as shown on line 5. https://CRAN.R-project.org/package=bayesplot https://CRAN.R-project.org/package=polspline 10 Table 2: Example data for a two-sample t-test. Raw 62 60 56 63 56 63 59 56 44 61 Roasted 57 56 49 61 55 61 57 54 62 58 From here, there are two paths worth exploring. First, I typically will produce a plot showing the components of the Savage-Dickey density ratio. Such a plot will consist of (1) a plot of the prior density for under H (the Cauchy(0,1) distribution); (2) a plot of the posterior density (from the logspline estimate); and (3) the ordinates of = 0 in both of these densities. Lines 7-21 will produce such a graph, which can be seen in Figure 2. Next, we can compute the Savage-Dickey density ratio using the code on lines 24-30. Lines 24-25 compute the speciﬁc ordinates required. posterior represents the ordinate of = 0 in the posterior distribution. Since the posterior was estimated from the logspline function, we call this estimate for our calculation using dlogspline along with the object name of our estimate from line 5 (i.e, fitPost). prior represents the ordinate of = 0 under H ; computing this uses a simple call to the dcauchy function. Finally, we can divide posterior by prior to compute B , which is denoted in line 26 by BF01. Lines 29 and 30 then simply display both B (the Bayes factor in favor of H 01 0 over H ) and B (the Bayes factor in favor of H over H ). From this computation, one can see that 1 10 1 0 we have moderate support for H , as B 0:4. The intuition for this can be had by looking at how 1 01 density at = 0 changes from prior to posterior. In Figure 2, the posterior density of = 0 decreases relative to the prior density, indicating that our belief in a null eect decreases after observing data x. Equivalently, we can use the reciprocal to compute B 2:6, indicating that our data are 2.6 times more likely under the alternative model H compared to the null model H , giving us positive 1 0 evidence in favor of a nonzero eect . Now, suppose the analyst had a dierent a prior belief about the eect sizes he or she would expect in this context. For illustration, let us suppose that is normally distributed with mean = 0 and variance = 0:3, as recommended by Killeen (2007). In this case, all of the above code could be run again with some minor changes. First, one would need to change line 9 in Listing 1 to delta = normal(0,sqrt(0.3)). Also, any line in Listing 2 that uses dcauchy(x,0,1) would need to be replaced by dnorm(x,0,sqrt(0.3)). After this minor change, the resulting Bayes factor would be B 0:25, or equivalently, B 4:0. Note that this Bayes factor is a bit larger than the 01 10 previous one in which the Cauchy prior was used for . The reason folllows from the fact that the Cauchy prior is more dispersed relative to a normal prior, and thus with a normal prior, we have a relatively greater prior mass on smaller eects. Particularly, the ordinate of = 0 is larger in the normal prior compared to the Cauchy prior. The result is that the ratio between the ordinates of = 0 in the prior and posterior becomes larger for the normal prior, thus giving us a larger Bayes factor. 3.2. Using the Savage-Dickey density ratio for a two-sample design The methods described above will readily scale up to problems involving two independent samples. All that is required is that the underlying model is adjusted accordingly. I will illustrate this with another example from Hoel (1984). Consider a sample of 20 rats, each of which receives their main source of protein from either raw peanuts or roasted peanuts. To compare weight gains as a function of protein source, a researcher randomly assigns 10 rats to receive only raw peanuts and 10 rats to receive only roasted peanuts. The resulting weight gains (in grams) are displayed in Table 2. First, we must consider the underlying model. As with the single-sample example, we can repre- sent this model as a directed acyclic graph, which is shown in Figure 3. In this model (inspired by 11 Listing 2: Plotting and computing the Savage-Dickey density ratio 1 Library(polspline) 3 # extract draws from MCMC object and fit a density estimate 4 posteriorDelta = draws[[1]][,3] 5 fitPost = logspline(posteriorDelta) 7 x = seq(-2,2,length.out=1000) 9 # plot density of prior and posterior together 10 plot(x, dlogspline(x, fitPost), type="l", main="", xlab="delta", 11 ylab="density", xlim=c(-2,2), lwd=2, lty=1) 13 # add prior 14 lines(x, dcauchy(x,0,1), lwd=2, lty=3) 16 # add points at 0 for both prior and posterior 17 points(0, dlogspline(0, fitPost), pch=19) 18 points(0, dcauchy(0,0,1), pch=19) 20 legend(-2,0.8, legend=c("Prior density","Posterior density"), 21 lty=c(3,1), lwd=c(2,2), bty="n") 23 # compute SD density ratio 24 posterior <- dlogspline(0, fitPost) 25 prior <- dcauchy(0,0,1) 26 BF01 <- posterior/prior 28 # display both Bayes factors 29 BF01 30 1/BF01 Wetzels et al., 2009), both independent samples x and y are assumed to be drawn from two normal distributions with shared variance . The mean of the parent distribution of x is + =2 and the mean for the parent distribution of y is =2. With this parameterization, represents the “eect” or dierence between the two populations. As with the single-sample example, we then scale this ef- fect to a standardized eect = =. Also, standard Cauchy priors are placed on , , and a truncated Cauchy prior is placed on . For concreteness, let us denote the sample of weight gains from the raw peanut diet as x and the weight gains from roasted peanuts as y. Given the model in Figure 3, our goal is to sample from the posterior distribution of . The R code necessary to perform this sampling is displayed in Listing 5. The procedure is similar to the single-sample model in Listing 1, but there are some notable modiﬁcations that are particular to the independent samples model. First, we need to rescale the raw data vectors x and y to z-scores. Since we are assuming shared variance, it suces to base both z- score transformations on only one of x and y. In lines 9-10, I have chosen to base the z-scores on x, but note that similar results will be obtained if instead the researcher chooses to base all z-scores on y. Lines 13-15 deﬁne the priors that we assigned to the parameters in our model. Lines 18-24 reﬂect our 12 Prior density Posterior density −2 −1 0 1 2 delta Figure 2: A plot showing the necessary components for computing the Savage-Dickey density ratio. Included are the density of the prior for under H (i.e., a Cauchy(0,1) distribution, depicted as a dashed line) as well as the logspline density estimate for the posterior of (depicted as a solid line). The Bayes factor B can be computed as the ratio of the ordinates of = 0 under the posterior and prior, respectively. δ ∼ Cauchy(0, 1) μ ∼ Cauchy(0, 1) α σ σ ∼ Cauchy(0, 1) α = δσ x ∼ Normal(μ + α/2, σ ) x 2 y ∼ Normal(μ − α/2, σ ) Figure 3: A graphical model for the posterior sampling independent samples t-test. assumption that data x and y are randomly drawn from two normal distributions centered at + =2 and =2, respectively. In lines 27 and 30 we tell Greta to pull 5000 samples from the posterior distribution of ; note that for simplicity, I have only included delta in the model, though one could add any other variable in the model if desired. These posterior samples are then extracted into the density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 13 Listing 3: Building and sampling from the independent-samples t-test model 1 # load libraries 2 library(greta) 4 # data from Hoel (1984) 5 x = c(62, 60, 56, 63, 56, 63, 59, 56, 44, 61) 6 y = c(57, 56, 49, 61, 55, 61, 57, 54, 62, 58) 8 # rescale so that x has mean=0 and sd=1 9 zx = (x-mean(x))/sd(x) 10 zy = (y-mean(x))/sd(x) 12 # priors 13 delta = cauchy(0,1) 14 mu = cauchy(0,1) 15 sigma = cauchy(0,1,truncation = c(0,Inf)) 17 # operations 18 alpha = delta*sigma 19 mux = mu + alpha/2 20 muy = mu - alpha/2 22 # likelihood 23 distribution(zx) = normal(mux,sigma) 24 distribution(zy) = normal(muy,sigma) 26 # define model 27 m = model(delta) 29 # MCMC sample 30 draws = mcmc(m, n_samples = 5000) 32 # extract draws from MCMC object 33 posteriorDelta = draws[[1]][,1] vector posteriorDelta in line 33. After completing the code in Listing 5, the Savage-Dickey density ratio can be plotted and com- puted as we did earlier in Listing 2. Figure 4 shows this ratio graphically; indeed, note that the poste- rior density of = 0 increases relative to the prior density. This ratio is computed to be B = 2:92, indicating that the data are 2.92 times more likely under H than under H . Thus, we can conclude 0 1 positive support for a null eect of peanut type on rats’ weight gain. 4. Extension: using encompassing priors for inequality constraints In the previous section, I described an extension of the JZS t-test that uses MCMC sampling to ap- proximate the posterior distribution of eect size . This method works for sharp hypotheses (i.e., a point null, such as H : = 0) by employing the Savage-Dickey density ratio, which reduces the calculation of the Bayes factor B into a simple ratio based on the ordinates of the point = 0 in both 01 14 Prior density Posterior density −2 −1 0 1 2 delta Figure 4: A plot showing the necessary components for computing the Savage-Dickey density ratio in the independent-samples model. Included are the density of the prior for under H (i.e., a Cauchy(0,1) distri- bution, depicted as a dashed line) as well as the logspline density estimate for the posterior of (depicted as a solid line). The Bayes factor B can be computed as the ratio of the ordinates of = 0 under the posterior and prior, respectively. the prior and posterior distributions for . Consider again the sleep example above. What if instead the researcher wanted to whether the new drug increased sleep in patients? This would require the ability to test a directional hypotheses H : > 0 against H : 0. At ﬁrst glance, this seems like quite a dierent problem, as the 1 0 Savage-Dickey density ratio does not directly apply to models with inequality constraints. However, there is a method due originally to Klugkist et al. (2005) that ﬁts with this type of problem. In their approach, Klugkist et al. cast such problems as one of testing models with inequality constraints nested within an encompassing model. In this context, both hypotheses H and H are considered as 0 1 speciﬁc inequality constraints nested within an encompassing modelH : , where is unconstrained (i.e., 2 R). The Klugkist et al. approach (which I will hereafter call the encompassing approach) amounts to using MCMC samples to calculate p(y j H ) B = 0e p(y j H ) and p(y j H ) B = : 1e p(y j H ) density 0.0 0.2 0.4 0.6 0.8 1.0 15 Once these two Bayes factors are computed, one can use transitivity of Bayes factors to compute B = B B = B : 01 0e e1 0e 1e The mechanics of the encompassing approach can be summarized in the following proposition: Proposition 2. Consider two models H and H , where H is nested within an encompassing model 1 e 1 H via an inequality constraint on some parameter , and is unconstrained under H . Then e e c 1=d B = = 1e d 1=c where 1=d and 1=c represent the proportions of the posterior and prior of the encompassing model, respectively, that are in agreement with the inequality constraint imposed by the nested model H . Proof: Consider ﬁrst that for any modelH on data y with parameter vector , Bayes’ theorem implies f (y j ;H ) p( j H ) t t p( j y;H ) = : p(y j H ) Thus, we can write the marginal likelihood for y under H as f (y j ;H ) p( j H ) t t p(y j H ) = : p( j y;H ) Taking the ratio of the marginal likelihoods for H and the encompassing model H yields the fol- 1 e lowing Bayes factor: f (y j ;H ) p( j H )= p( j y;H ) 1 1 1 B = : 1e f (y j ;H ) p( j H )= p( j y;H ) e e e Now, both the constrained model H and the encompassing model H contain the same parameters 1 e . Choose a speciﬁc value of , say , that exists in both models H and H (we can do this because 1 e 0 0 0 H is nested within H . Then, for this parameter value , we have f (y j ;H ) = f (y j ;H ), so 1 e 1 2 the expression for the Bayes factor reduces to an expression involving only the priors and posteriors for under H and H : 1 e 0 0 p( j H )= p( j y;H ) 1 1 B = : 1e 0 0 p( j H )= p( j y;H ) e e BecauseH is nested withinH via an inequality constraint, the prior p( j H ) is simply a truncation 1 e 1 0 0 of the encompassing prior p( j H ). Thus, we can express p( j H ) in terms of the encompassing e 1 prior p( j H ) by multiplying the encompassing prior by an indicator function over H and then e 1 normalizing the resulting product. That is, p( j H ) I e 2H p( j H ) = 0 0 p( j H ) I 0 d e 2H 2H = R p( j H ); 0 0 p( j H ) I 0 d e 2H 1 16 where I 0 is an indicator function. For parameters 2 H , this indicator function is identically 2H 1 equal to 1, so the expression in parentheses reduces to a constant, say c, allowing us to write 0 0 p( j H ) = c p( j H ): 1 e By similar reasoning, we can write the posterior as I 0 2H 0 0 0 p( j y;H ) = p( j y;H ) = d p( j y;H ): 1 e e 0 0 p( j y;H )I 0 d e 2H This gives us 0 0 c p( j H )=d p( j y;H ) c 1=d e e B = = = : 1e 0 0 p( j H )= p( j y;H ) d 1=c e e Note that by deﬁnition, 1=d represents the proportion of the posterior distribution for under the encompassing model H that agrees with the constraints imposed by H . Similarly, 1=c represents e 1 the proportion of the prior distribution for under the encompassing model H that agrees with the constraints imposed by H . It might seem a bit odd to represent the fraction c=d in the form (1=d)=(1=c). However, this is again done for a computational advantagem, as we can use MCMC sampling to easily estimate the proportions 1=d and 1=c. Also note that in some sense, the encompassing prior approach of Klugkist et al. (2005) is a generalized version of the Savage-Dickey density ratio. Indeed, Wetzels et al. (2010) proved that under “about equality” constraints (e.g., a constrained model M : " < < " for " > 0), the Bayes factor derived from the encompassing approach tends toward the Bayes factor (for the point null where = 0) obtained from the Savage-Dickey density ratio as " ! 0. 4.1. Computing Bayes factors with the encompassing approach To illustrate the computation of Bayes factors with the encompassing approach, let us consider the problem mentioned immediately above – suppose we wanted to test whether the drug that we ad- ministered to sleep patients actually increased the patients’ sleep. Speciﬁcally, we wish to compare H : 0 against H : > 0. We will do this by considering both H and H as models with 0 1 0 1 inequality constraints nested with an encompassing model H : , where is unconstrained. Then, we can use transitivity to compute B = B B = B =B . 10 1e e0 1e 0e The R code necessary to perform these computations is in Listing 4. As the encompassing model is deﬁned identical to that from Figure 1, the code assumes that we have already drawn samples from that model, as we did in Listing 1. Note that just like our previous computations with the Savage- Dickey density ratio, using the encompassing approach requires that we sample from the posterior of under the unconstrained, encompassing model H . Though the notation is dierent, this is exactly the same posterior distribution that we sampled from in Listing 1. The key steps in Listing 4 are as follows. First, we will compare H to the encompassing model H . To this end, we need to compute the proportion of posterior samples from the encompassing model that are in agreement with the inequality constraint imposed by H (this is the quantity 1=d in the proof of Proposition 2). We say that such samples are “evidential” of H . The R code that will compute this proportion is in line 7. Then, we need to compute the proportion of evidential samples in the prior (i.e., 1=c). Since the prior has known density Cauchy(0; 1), we can use the pcauchy command to directly compute the proportion of values that are less than 0; this computation proceeds 17 Listing 4: Computing a Bayes factor for a directional hypothesis using the encompassing approach 1 # directional hypothesis using encompassing priors 2 # H_0: "null" model: delta <= 0 3 # H_1: "alternative" model: delta > 0 4 # H_e: "encompassing" model: delta ~ Cauchy (unconstrained) 6 # H_0 versus H_e 7 postEvidential = mean(posteriorDelta <=0) 8 priorEvidential = pcauchy(0,0,1) 9 BF0e = postEvidential/priorEvidential 11 # H_1 versus H_e 12 postEvidential = mean(posteriorDelta >0) 13 priorEvidential = 1-pcauchy(0,0,1) 14 BF1e = postEvidential/priorEvidential 16 # H_1 versus H_0 17 BF10 = BF1e/BF0e; BF10 in line 8. Then, by Proposition 2, we can simply divide these two quantities to compute B (see line 0e 9). Next, we do a similar computation with H versus the encompassing model H , shown in lines 1 e 12-14. This gives us a value for B . Now, we can compute the Bayes factor for H over H by 1e 1 0 computing B =B 65, indicating that the observed data are approximately 65 times more likely 1e 0e under the alternative model H : > 0 compared to the null model H : 0. 1 0 This approach can be extended to test a wide variety of hypotheses involving inequality con- straints. One particular advantage of the encompassing approach is that it gives us the ability to test interval null hypotheses – that is, hypotheses of the form H : " < < ". To illustrate, consider the analyst who is not interested in whether an eect is exactly 0, but rather, is interested in whether an eect is larger than threshold, say " = 0:2. An example of such computation is displayed in Listing 5. Like in the example above, we deﬁne three hypotheses: two competing hypotheses H : jj < " and H : jj > ", both nested within an 0 1 encompassing modelH : . In the example, I have set " = 0:20, but one can set this value at whatever value seems reasonable for the given context. As in the example before, we use the posterior samples for under H that were generated in Listing 1 to calculate the proportions of the posterior that satisﬁed the inequality constraints on imposed byH andH . These computations are performed in lines 9 and 14. The relevant proportions 0 1 of the unconstrained prior that obey the imposed inequality constraints are again calculated using the pcauchy command, as seen in lines 10 and 15. Finally, lines 11,16, and 19 calculate the relevant Bayes factors. As we can see, the Bayes factor B 2:2, indicating that the observed data are 2.2 times more likely under the model H : jj > " compared to the model H : jj < ". Notice that this 1 0 is similar to, but less than, the Bayes factor obtained with the point null hypothesis H : = 0 from earlier in the paper. 18 Listing 5: Computing a Bayes factor for an interval null hypothesis using the encompassing approach 1 # interval null via encompassing priors 2 # H_0: "null" model: |delta| < epsilon 3 # H_1: |delta| > epsilon 4 # H_e: delta ~ Cauchy (unconstrained) 6 epsilon = 0.2 8 # H_0 versus H_e 9 postEvidential = mean(abs(posteriorDelta) < epsilon) 10 priorEvidential = pcauchy(epsilon ,0,1)-pcauchy(-epsilon ,0,1) 11 B_0e = postEvidential/priorEvidential 13 # H_1 versus H_e 14 postEvidential = mean(abs(posteriorDelta) > epsilon) 15 priorEvidential = (1-pcauchy(epsilon ,0,1)) + pcauchy(-epsilon ,0,1) 16 B_1e = postEvidential/priorEvidential 18 # H_1 versus H_0 19 B_10 = B_1e/B_0e; B_10 5. Conclusions In this tutorial, I have demonstrated a ﬂexible approach to extending the default JZS t-test, a Bayesian test that is becoming increasingly popular in the social and behavioral sciences (Rouder et al., 2009). The approach uses two theoretical results, the Savage-Dickey density ratio (Dickey and Lientz, 1970) and the method of encompassing priors (Klugkist et al., 2005) in combination with an easy-to-use probabilistic modeling package for R called Greta (Golding, 2018). Though the examples presented in this paper are quite trivial to implement, they provide the reader with a general workﬂow that can be extended to solve problems relevant to his or her own work. Inherent in the techniques presented here is ﬂexibility; the user has complete freedom to specify the underlying models and speciﬁc model comparisons in any way that he or she wishes. Finally, the Greta modeling language is easy to learn and readily extends to more complex modelsy. Furthermore, by harnessing the power of Google Tensorﬂow (Abadi et al., 2015), the MCMC sampler is fast, with all models described in the paper converging in less than one minute. In summary, I think this is an advantageous approach to using de- fault Bayesian tests for common hypothesis testing scenarios, especially those common in the social, behavioral, and other applied sciences. Acknowledgement I am grateful to the handling editor and two anonymous referees for their comments on an earlier version of this manuscript. References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, 19 R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, ´ D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vieg ´ as, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorﬂow.org. Carpenter, B., Gelman, A., Homan, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1). Dickey, J. M. and Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a markov chain. The Annals of Mathematical Statistics, 41(1):214–226. Faulkenberry, T. J. (2018). Computing Bayes factors to measure evidence from experiments: An extension of the BIC approximation. Biometrical Letters, 55(1):3143. Gabry, J. and Mahr, T. (2018). bayesplot: Plotting for Bayesian Models. R package version 1.6.0. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409. Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5):587–606. Gilks, W. R., Thomas, A., and Spiegelhalter, D. J. (1994). A language and program for complex bayesian modelling. The Statistician, 43(1):169–177. Golding, N. (2018). greta: Simple and Scalable Statistical Modelling in R. R package version 0.3.0.9001. Hoekstra, R., Morey, R. D., Rouder, J. N., and Wagenmakers, E.-J. (2014). Robust misinterpretation of conﬁdence intervals. Psychonomic Bulletin & Review, 21(5):1157–1164. Hoel, P. G. (1984). Introduction to Mathematical Statistics. John Wiley & Sons, New York, 5th edition. JASP Team (2018). JASP (Version 0.9)[Computer software]. Jereys, H. (1961). The Theory of Probability (3rd ed.). Oxford University Press, Oxford, UK. Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795. Killeen, P. R. (2007). Replication statistics as a replacement for signiﬁcance testing: Best practices in scientiﬁc decision-making. In Osborne, J. W., editor, Best practices in quantitative methods, pages 103–124. SAGE Publications, Inc., Thousand Oaks, CA. Klugkist, I., Kato, B., and Hoijtink, H. (2005). Bayesian model selection using encompassing priors. Statistica Neerlandica, 59(1):57–69. Kooperberg, C. (2018). polspline: Polynomial Spline Routines. R package version 1.1.13. Kooperberg, C. and Stone, C. J. (1992). Logspline density estimation for censored data. Journal of Computational and Graphical Statistics, 1(4):301–328. 20 Lunn, D. J., Thomas, A., Best, N., and Spiegelhalter, D. (2000). WinBUGS-a bayesian modelling framework: concepts, structure, and extensibility. Statistics and computing, 10(4):325–337. Masson, M. E. J. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis signiﬁcance testing. Behavior Research Methods, 43(3):679–690. Morey, R. D. and Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16(4):406–419. Morey, R. D. and Rouder, J. N. (2018). BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.12-4.2. Neal, R. (2011). Mcmc using hamiltonian dynamics. In Brooks, S., Gelman, A., Jones, G., and Meng, X., editors, Handbook of Markov Chain Monte Carlo, pages 116–162. Chapman and Hall/CRC. Oakes, M. (1986). Statistical Inference: A commentary for the social and behavioural sciences. John Wiley & Sons, Chicester. Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sam- pling. R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25:111–163. Richard, F. D., Bond, C. F., and Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7(4):331–363. Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., and Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2):225–237. Stone, C. J., Hansen, M. H., Kooperberg, C., and Truong, Y. K. (1997). Polynomial splines and their tensor products in extended linear modeling: 1994 Wald memorial lecture. The Annals of Statistics, 25(4):1371–1470. Wagenmakers, E., Wetzels, R., Borsboom, D., and van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3):426–432. Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., and Grasman, R. (2010). Bayesian hypothesis test- ing for psychologists: A tutorial on the SavageDickey method. Cognitive Psychology, 60(3):158– Wang, M. (2017). Mixtures of g -priors for analysis of variance models with a diverging number of parameters. Bayesian Analysis, 12(2):511532. Wetzels, R., Grasman, R. P., and Wagenmakers, E.-J. (2010). An encompassing prior generalization of the savagedickey density ratio. Computational Statistics & Data Analysis, 54(9):20942102. 21 Wetzels, R., Raaijmakers, J. G. W., Jakab, E., and Wagenmakers, E.-J. (2009). How to quantify support for and against the null hypothesis: A ﬂexible WinBUGS implementation of a default bayesian t test. Psychonomic Bulletin & Review, 16(4):752–760. Zellner, A. and Siow, A. (1980). Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa, 31(1):585–603. Appendix: In this appendix, I report a simulation study designed to benchmark performance of the posterior sampling generalization of the JZS t-test described in this paper against the version of the JZS test originally proposed by Rouder et al. (2009). For each simulation, I randomly generated 200 single-sample data sets of size N (where N = 20, 50, or 80) under the model y = +" . For each of these data sets, dierent “eects” were represented i i by varying the parameter , which was assumed to be drawn randomly from a normal distribution with mean 0 and variance g (where g = 0, 0.05, or 0.2; also see Wang, 2017; Faulkenberry, 2018). Each of the errors " for a given data set was drawn from a normal distribution with mean 0 and variance 1. The resulting combinations of 3 dierent sample sizes (N = 20; 50; 80) and 3 dierent eect parameters (g = 0; 0:05; 0:2) produced a total of nine simulations. Once a simulated data set was constructed, I computed the Bayes factor B for the null hypothesis H : = 0 over the alternative hypothesisH : = 1 using two methods: (1) the JZS Bayes factor of 0 1 Rouder et al. (2009), and (2) the posterior sampling technique. The JZS Bayes factor was computed using the ttestBF function from the BayesFactor package in R, and the posterior sampling Bayes factor was computed using the methods described in this paper in Section 4.1 (i.e., drawing posterior samples using Greta, ﬁtting a logspline estimate of the posterior, and then computing the Savage- Dickey density ratio by comparing the ordinates of = 0 in the posterior and prior, respectively). Each Bayes factor was computed using a Cauchy prior of scale r = 1. In all, I found the two methods to be quite comparable to each other. To see why, let’s ﬁrst inspect the distributions of Bayes factors obtained for each of the nine combinations of N and g. Figure A.1 shows these via overlaid density plots of log(B ) for each computation method. As one can readily see in Figure A.1, the density plots have considerable overlap, indicating that both methods produced very similar distributions of Bayes factors. Further evidence for the compatibility of the two techniques comes from Table A.1, shows ﬁve- number summaries for the values of log(B ) obtained for each condition. Additionally, I computed model selection consistency, deﬁned as the proportion of simulated data sets for which the JZS Bayes factor and the posterior sampling Bayes factor led to the same model choice. For this computation, model selection was determined by computing log(B ). If log(B ) > 0, thenH was selected; other- 01 01 0 wise,H was selected (see also Faulkenberry, 2018). As is shown in Table A.1, the posterior sampling technique again produced a distribution of Bayes factors that was very similar to those obtained from the JZS Bayes factor, mirroring what is shown in Figure A.1. Critically, both computation methods selected the same model in a large percentage of the simulated data sets, as indicated by the large consistency values in Table A.1. In all, the proposed sampling method for computing Bayes factors described in this tutorial seems to be quite consistent with the established, albeit less ﬂexible, JZS Bayes factor of Rouder et al. (2009). Thus, the researcher can be conﬁdent that the posterior sampling methods described in this paper not only aord a great deal of ﬂexibility, but also benchmark well against other established methods of computation. 22 g=0, N=20 g=0, N=50 g=0, N=80 1.5 1.0 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 −2 −1 0 1 2 −2 −1 0 1 2 −4 −2 0 2 g=0.05, N=20 g=0.05, N=50 g=0.05, N=80 0.4 0.2 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.0 −4 −2 0 2 −8 −4 0 −6 −3 0 3 g=0.2, N=20 g=0.2, N=50 g=0.2, N=80 0.15 0.09 0.2 0.10 0.06 0.1 0.05 0.03 0.0 0.00 0.00 −20 −15 −10 −5 0 −25 −20 −15 −10 −5 0 −30 −20 −10 0 log(BF) log(BF) log(BF) Figure A.1: Density plots of log(B ) for each of the nine combinations of ”eect” g (0,0.05,0.2) and sample size N (20,50,80). The JZS Bayes factor distribution is displayed as a solid line, whereas the distribution of posterior sampling Bayes factors is displayed as a dashed line. Table A.1: Five-number summary of log(B ) and model selection consistency. g N BF type Min Q Median Q Max Consistency 1 3 0 20 JZS -1.77 -1.71 -1.51 -1.12 1.76 sampling -1.95 -1.70 -1.51 -1.08 1.88 0.985 50 JZS -2.20 -2.16 -2.01 -1.56 1.51 sampling -2.47 -2.16 -1.99 -1.53 2.49 0.985 80 JZS -2.43 -2.38 -2.20 -1.69 4.21 sampling -2.55 -2.36 -2.18 -1.61 2.90 1.000 0.05 20 JZS -4.43 0.23 1.11 1.61 1.77 sampling -4.95 0.27 1.12 1.62 2.04 0.995 50 JZS -10.86 0.08 1.48 1.97 2.20 sampling -4.95 0.27 1.12 1.62 2.04 0.975 80 JZS -8.09 -0.79 1.14 2.06 2.43 sampling -6.96 -0.79 1.16 2.08 2.51 0.945 0.2 20 JZS -10.98 -1.76 0.39 1.44 1.77 sampling -19.90 -1.81 0.25 1.42 1.86 0.975 50 JZS -23.85 -4.25 0.05 1.51 2.20 sampling -12.39 -3.19 0.01 1.49 2.36 0.965 80 JZS -37.79 -6.78 -1.13 1.67 2.43 sampling -25.60 -4.77 -1.11 1.85 2.65 1.000 density density density
Statistics – arXiv (Cornell University)
Published: Dec 7, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.