A Comparison of Methods for Adaptive Experimentation
A Comparison of Methods for Adaptive Experimentation
Horn, Samantha;Sloman, Sabina J.
2022-07-01 00:00:00
We use a simulation study to compare three methods for adaptive experimentation: Thomp- son sampling, Tempered Thompson sampling, and Exploration sampling. We gauge the performance of each in terms of social welfare and estimation accuracy, and as a function of the number of experimental waves. We further construct a set of novel \hybrid" loss measures to identify which methods are optimal for researchers pursuing a combination of experimental aims. Our main results are: 1) the relative performance of Thompson sampling depends on the number of experimental waves, 2) Tempered Thompson sam- pling uniquely distributes losses across multiple experimental aims, and 3) in most cases, Exploration sampling performs similarly to random assignment. Keywords: adaptive experimentation, response-adaptive randomization Adaptive experiments have recently gained popularity in the social sciences. While traditional methods for adaptive experimentation target participant welfare, a body of lit- erature shows that these methods forgo statistical power and can introduce bias in the estimation of the ecacy of some interventions. We compare three methods for adaptive experimentation | Thompson sampling (Thompson, 1933), Exploration sampling (Kasy and Sautmann, 2021) and Tempered Thompson sampling (Caria et al., 2020) | and inves- tigate their relative performance as a function of the number of experimental waves, and with respect to a diverse set of base and hybrid loss measures, corresponding, respectively, to singular and dual experimental aims. 1. Problem Setup and Background Consider an experimenter who has access to a population of N experimental participants, each of whom participates in one of T experimental waves, indexed by t = f1; : : : ; Tg. N refers to the number of participants who participate in wave t. We index each participant †. Joint rst authors. 1. Other literature refers to similar methods as response-adaptive randomization. 2. See, e.g., Trippa et al. (2012), Wason and Trippa (2014), Lin and Bunn (2017), Wathen and Thall (2017), Viele et al. (2020), Ryan et al. (2020) and Kaibel and Biemann (2021). 3. In our simulations, where N is always evenly divisible by T , N = 8 t. arXiv:2207.00683v1 [stat.ME] 1 Jul 2022 Horn and Sloman by i = f1; : : : ; N g. For each participant i at time t, the experimenter observes an outcome Y 2 f0; 1g, with 1 indicating the participant experienced a desirable outcome and 0 i;t indicating the absence of that outcome. Each participant i at time t is assigned to one of a xed set of treatments, or inter- ventions, D 2 D where jDj = K . The outcome conditional on reception of treatment i;t D is assumed to follow a Bernoulli( ) distribution. is the average potential outcome k k k corresponding to treatment D . The number of participants assigned to D at time t is k k denoted n . The experimenter starts with a prior distribution on the average potential outcome of each treatment D . After each wave t, they use Bayesian inference to update this distribution based on the observed outcomes. p( ) denotes the posterior probability of . We use k to index the treatment with the highest average potential outcome (unknown to the experimenter). k indexes the treatment with the highest estimated average potential outcome at the end of the experiment, i.e., k argmax p( ) d . In prac- k k k k2f1;:::;Kg tice, this can be thought of as the treatment deemed most likely to be eective based on the data collected, and perhaps implemented as policy. 2. Description of Assignment Mechanisms Each adaptive experimentation method, or assignment mechanism, we evaluate diers in how n , the number of participants assigned to each treatment D at wave t, is determined. We compare all assignment mechanisms to the baseline of random assignment (RA) in which the probability of assignment to each treatment is constant across waves and is simply : When using Thompson sampling (Thompson, 1933), the probability of assignment to treatment group k in experimental wave t is: thompson p = P(k = k ) t;k Exploration sampling (Kasy and Sautmann, 2021) provides a slight modi cation to Thompson sampling and is designed to increase power for rejecting suboptimal treatments. This is achieved by modifying the assignment probabilities as follows: thompson thompson p (1 p ) t;k t;k exploration p = t;k thompson thompson p (1 p ) k t;k t;k Tempered Thompson sampling is a method intended to strike a balance between painting an overall picture of the eectiveness of each treatment and minimizing in-sample regret (Caria et al., 2020). It assigns participants to arm k proportionally to the weighted average thompson of (the assignment probability under RA) and p . In other words, the probability K t;k of assignment to treatment group k in experimental wave t is: tempered thompson p = (1
)p + t;k t;k 2 Methods for Adaptive Experimentation Description Notation Calculation P P T N 1 t In-sample regret R sample D i=1 i=1 i;t Regret Policy regret R policy D RMSE of PREC RMSE ^ best ^ Estimation k k precision Average RMSE PREC RMSE avg k K k=1 Fails to order Statistical ^ ^ treatments by SP 1 I P > < 8k 2 f2; : : : ; Kg (k) (k 1) power Table 1: Loss measures. where
2 [0; 1] allows researchers a degree of freedom in how much weight is placed on the Thompson assignment probabilities.
can also be thought of as controlling how much the sampling process targets regret minimization over estimation accuracy. 3. Experimental Setup Each of our simulated experiments tested three \treatments," each with a true average potential outcome drawn from a standard uniform distribution. For each set of three treat- ments, we ran experiments using each of the four assignment mechanisms described above at each of three levels of N : N 2 f4; 10; 100g. For each experiment we xed the total t t population size N at 1; 000, in eect predetermining the number of experimental waves, T 2 f250; 100; 10g. We thus ran 4 assignment mechanisms 3 levels of N 10,000 sets of treatments = 120,000 experiments in total. At the beginning of each experiment, we began with an uninformative Beta(1; 1) prior for each of , and . 1 2 3 Loss measures. For each experiment, we analyze its performance with respect to several loss measures, each of which corresponds to a potential experimental goal. Table 1 summa- rizes the three classes of loss measures we consider: measures of regret, estimation precision and statistical power. Regret-based measures rely on the regret associated with a particular treatment D . Regret measures the amount of welfare lost compared to what would have been lost if all receivers were assigned to D . Formally, it is de ned as k k k where indicates the true (in practice, unknowable) eect of treatment D . 4. In our simulations, we set
= :2. 5. Replication code available at https://github.com/sami-horn/adaptive-experimentation. 3 Horn and Sloman Precision-based measures rely on the root mean-squared error (RMSE ) of the posterior distribution of the average potential outcome associated with a particular D : RMSE ( ) p( ) d k k k k Our power-based measure determines whether the study was able to identify the correct ordering of arms based on their true average potential outcomes. It measures the ability of a series of statistical tests with controlled Type-I error to recover the true rank order of , and . 2 3 Hybrid loss measures are pairwise combinations of the \base" loss measures described above. For example, a hybrid of R and PREC (denoted by R =PREC ) sample avg sample avg would represent the dual goal of both maximizing social welfare in the participant sample and the precision of the estimated average potential outcomes. Because the regret- and precision-based measures are computed on the same scale (each corresponds to the mag- nitude of a dierence between two average potential outcomes and is lower-bounded by 0 and upper-bounded by 1), for hybrid loss measures that are made up of combinations of a regret and precision loss measure we simply take the average of the two measures. For hybrid loss measures that combine a regret- or precision-based measure L with SP , we take the maximum value of the two measures. This equals the value of L in case the correct ordering is identi ed (SP = 0); otherwise, the maximum loss of 1 is incurred. This can be interpreted similarly to a constrained objective, in which the \constraint" is that the correct ordering is identi ed. 4. Results Base loss measures. Panels A and B of Figure 1 show performance on the two regret- based measures. R is minimized by Thompson sampling regardless of the number of sample experimental waves. R is generally imprecisely measured and very low, suggesting that policy all methods usually identify the best treatment arm. Panels C and D show how each method performs on the two precision-based measures. Thompson sampling results in higher PREC than other methods, and the PREC avg avg values associated with Thompson sampling increase dramatically with the number of ex- perimental waves. The pattern of results for PREC is similar to R . best sample Finally, Panel E plots performance for SP . This resembles the patterns shown in Panel C, which re
ects that both PREC and SP require precise estimation of the average avg potential outcomes associated with all three treatments. However, Exploration sampling consistently outperforms RA on SP . Overall, Tempered Thompson sampling performs similarly to or better than Thompson sampling, without exhibiting large variation in performance by the number of experimental waves. 6. In our empirical results, we x the Type-I error for each pairwise hypothesis test to .05, and use Monte Carlo draws from each p( ) to generate empirical p-values. 7. In the case of the precision measures, this is the expectation of a dierence with respect to the posterior distribution of the average potential outcome. 4 Methods for Adaptive Experimentation Figure 1: Average performance on loss measures as a function of number of experimental waves. See Section 4.1 for details on loss measures. Error bars represent 95% con dence intervals. Hybrid loss measures. Figure 2 shows the loss-minimizing assignment mechanism for each possible hybrid measure. To identify the \loss-minimizing" mechanism, we computed the hybrid loss achieved by each assignment mechanism on each experiment, and identi ed the mechanism which achieved the lowest loss on the greatest number of trials. When the number of experimental waves is small, Thompson sampling most often min- imizes loss according to almost every measure, outperformed by RA on only PREC , avg PREC =PREC , PREC =R and PREC =SP | all of which require accu- avg best avg policy avg rate estimation of the average potential outcomes of all treatment arms. However, this seemingly near-universal bene t of Thompson sampling does not persist in the case of large numbers of experimental waves. In these cases, Thompson sampling per- forms well for pairwise combinations of R , PREC and R . In a complementary sample best policy pattern, Exploration sampling and RA perform well for pairwise combinations of R , policy PREC and SP . Further inspection showed that, with the exception of SP=R and avg policy 8. We ran a similar analysis treating the loss-minimizer as the mechanism achieving the lowest average loss across experiments. Those results dier from those shown here in two notable ways: 1) Panel A resembles Panels B and C, i.e., Thompson sampling's advantages when there are few experimental waves are not apparent, and 2) Thompson sampling is never selected as the loss-minimizer for R (as shown policy in Panel B of Figure 1, on this measure, Thompson sampling is outperformed at all levels of N ). 5 Horn and Sloman Figure 2: The assignment mechanism that most often minimizes each of the hybrid loss measures (the diagonal indicates the loss-minimizing assignment mechanism for each base loss measure). Numbers indicate the proportion of simulations on which the indicated assignment mechanism had the lowest corresponding loss. SP , Exploration sampling and RA perform similarly on all of these measures, highlighting the ability of both to accurately estimate the average potential outcomes of all treatments. Our results suggest that Tempered Thompson sampling is best when the objective re- quires both over-sampling from the best treatment (R and PREC ) and precise sample best estimates for all treatment arms (SP and PREC ). Notably, Tempered Thompson sam- avg pling does not excel at minimizing any base measure in isolation; its comparative advantage stems from its ability to distribute losses across dual experimental aims. This re
ects the fact that Thompson sampling is constructed as a blend of two other assignment mecha- nisms, Thompson sampling and RA, with the explicit aim of striking a balance between the bene ts of both (see section 2). 5. Discussion We evaluated three methods for adaptive experimentation with respect to a set of base and hybrid loss measures. We found that 1) the relative performance of Thompson sampling depends on how participants are distributed across experimental waves, 2) Exploration sampling maximizes statistical power to discriminate between treatment arms (Kasy and Sautmann, 2021), and 3) Tempered Thompson sampling balances overall statistical power with an understanding of the apparently best treatment (Caria et al., 2020). While our hybrid loss measures represent one way of constructing a quantitative trade-o between dual experimental aims, more practically useful measures would attribute weight to dierent aims in a way that more closely re
ects the objectives of a particular researcher or problem domain. Construction of such application-speci c measures is an important next step for future work. 9. We discuss Exploration sampling's persistent advantage with respect to SP above; since the values of R are so small, SP=R is usually dominated by SP . policy policy 6 Methods for Adaptive Experimentation Acknowledgments We would like to acknowledge support for this work from the Center for Machine Learn- ing and Health (CMLH) at Carnegie Mellon University. SJS was supported by a Tata Consultancy Services (TCS) Fellowship while contributing to this work. References Stefano Caria, Maximilian Kasy, Simon Quinn, Soha Shami, Alex Teytelboym, et al. An adaptive targeted eld experiment: Job search assistance for refugees in jordan. 2020. Chris Kaibel and Torsten Biemann. Rethinking the gold standard with multi-armed ban- dits: Machine learning allocation algorithms for experiments. Organizational Research Methods, 24(1):78{103, 2021. Maximilian Kasy and Anja Sautmann. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113{132, 2021. Jianchang Lin and Veronica Bunn. Comparison of multi-arm multi-stage design and adap- tive randomization in platform clinical trials. Contemporary clinical trials, 54:48{59, Elizabeth G Ryan, Sarah E Lamb, Esther Williamson, and Simon Gates. Bayesian adaptive designs for multi-arm trials: an orthopaedic case study. Trials, 21(1):1{16, 2020. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285{294, 1933. Lorenzo Trippa, Eudocia Q Lee, Patrick Y Wen, Tracy T Batchelor, Timothy Cloughesy, Giovanni Parmigiani, and Brian M Alexander. Bayesian adaptive randomized trial design for patients with recurrent glioblastoma. Journal of Clinical Oncology, 30(26):3258, 2012. Kert Viele, Kristine Broglio, Anna McGlothlin, and Benjamin R Saville. Comparison of methods for control allocation in multiple arm studies using response adaptive random- ization. Clinical Trials, 17(1):52{60, 2020. James MS Wason and Lorenzo Trippa. A comparison of bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials. Statistics in medicine, 33(13):2206{ 2221, 2014. J Kyle Wathen and Peter F Thall. A simulation study of outcome adaptive randomization in multi-arm clinical trials. Clinical Trials, 14(5):432{440, 2017.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.pngEconomicsarXiv (Cornell University)http://www.deepdyve.com/lp/arxiv-cornell-university/a-comparison-of-methods-for-adaptive-experimentation-ydAknCgAYl
A Comparison of Methods for Adaptive Experimentation