Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Development and Testing of an Abbreviated Numeracy Scale: A Rasch Analysis Approach

Development and Testing of an Abbreviated Numeracy Scale: A Rasch Analysis Approach Journal of Behavioral Decision Making, J. Behav. Dec. Making, 26: 198–212 (2013) Published online 15 March 2012 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/bdm.1751 Development and Testing of an Abbreviated Numeracy Scale: A Rasch Analysis Approach 1 1 2 1 1 2 JOSHUA A. WELLER *, NATHAN F. DIECKMANN , MARTIN TUSLER ,C.K.MERTZ , WILLIAM J. BURNS and ELLEN PETERS Decision Research, Eugene, OR, USA Department of Psychology, The Ohio State University, Columbus, OH, USA ABSTRACT Research has demonstrated that individual differences in numeracy may have important consequences for decision making. In the present paper, we develop a shorter, psychometrically improved measure of numeracy—the ability to understand, manipulate, and use numerical information, including probabilities. Across two large independent samples that varied widely in age and educational level, participants completed 18 items from existing numeracy measures. In Study 1, we conducted a Rasch analysis on the item pool and created an eight-item numeracy scale that assesses a broader range of difficulty than previous scales. In Study 2, we replicated this eight-item scale in a separate Rasch analysis using data from an independent sample. We also found that the new Rasch-based numeracy scale, compared with previous measures, could predict decision-making preferences obtained in past studies, supporting its predictive validity. In Study, 3, we further established the predictive validity of the Rasch-based numeracy scale. Specifically, we examined the associations between numeracy and risk judgments, compared with previous scales. Overall, we found that the Rasch-based scale was a better linear predictor of risk judgments than prior measures. Moreover, this study is the first to present the psychometric properties of several popular numeracy measures across a diverse sample of ages and educational level. We discuss the usefulness and the advantages of the new scale, which we feel can be used in a wide range of subject populations, allowing for a more clear understanding of how numeracy is associated with decision processes. Copyright © 2012 John Wiley & Sons, Ltd. key words numeracy; decision making; individual differences; Rasch analysis; cognitive reflection test Decision making today involves making sense of a morass difficulty using numerical information to compare Medicare of information from various sources, such as insurance compa- health plans. nies, financial advisors, and marketers (Hibbard, Slovic, Peters, Although there are several numeracy measures available Finucane, & Tusler, 2001; Thaler & Sunstein, 2003; Woloshin, to researchers (e.g., Lipkus, Samsa, & Rimer, 2001; Peters, Schwartz, & Welch, 2004). Today’s consumers need an under- Dieckmann et al., 2007; Schwartz, Woloshin, Black, & standing of numbers and basic mathematical skills to use Welch, 1997), the distributional characteristics of these numerical information presented in text, tables, or charts. scales previously reported suggest that the items in these However, consumers differ considerably in their ability to measures may possess a limited range of difficulty (Cokely & understand and use such information (Peters, Dieckmann, Kelley, 2009; Cokely, Galesic, Schulz, Ghazal, & Garcia- Dixon, Hibbard, & Mertz, 2007). Numbers are generally pro- Retamero, 2012). Administering a measure that does not vided to facilitate choices, but they can be confusing or difficult match the range of ability level of the population of interest, to understand and use for even the most motivated and skilled which may be the case for highly numerate populations such individual, and appear to be more so for those who are less as college students or ones that are less numerate (e.g., older skilled. adults or those with lower educational levels), potentially Research has demonstrated that individual differences in limits the test’s ability to discriminate ability level. Put dif- numeracy, the ability to comprehend and manipulate proba- ferently, the items in the measure essentially become redun- bilistic and other numeric information, may have important dant as respondents answer all items correctly in the former consequences for decision making (Estrada, Barnes, Collins, case and incorrectly in the latter. Therefore, a numeracy & Byrd, 1999; Reyna, Nelson, Han, & Dieckmann, 2009). measure with a greater range of difficulty would be desir- An estimate from the National Adult Literacy Survey able. In the current study, we developed such a measure by (Educational Testing Service, 1992) suggests that approxi- adopting an item response theory (IRT) approach. Using mately half of the US population has only very basic or scaling procedures developed by Rasch (1960/1993), we below basic quantitative skills (Kirsch, Jungeblut, Jenkins, created a measure of numeracy derived from existing mea- & Kolstad, 2002). The National Assessment of Adult Literacy sures shown to be related to decision-making behavior. (Kutner, Greenberg, Jin, & Paulsen, 2006; NCES, 2003) demonstrated similar results. In addition, these problems may be particularly acute for older adults. For example, Existing measures of numeracy Hibbard et al. (2001) found that a large proportion of older Researchers have measured numeracy in various ways often adults (more than half of those over age 65) had substantial because of differences in their specific research interests and domains of study (Reyna et al., 2009). Some scales have focused on subjective perceptions of one’s own numerical *Correspondence to: Joshua Weller, Decision Research, 1201 Oak Street, Suite 200, Eugene, OR 97401, USA. E-mail: [email protected] abilities (Fagerlin et al., 2007; Woloshin et al., 2004; Copyright © 2012 John Wiley & Sons, Ltd. J. A. Weller et al. Rasch-Based Numeracy Scale 199 Zikmund-Fisher, Smith, Ubel, & Fagerlin, 2007) in an and r = .40 in Brazilian and US samples, respectively; Cohen, attempt to measure numeracy without directly asking parti- 1992) between the 11-item Lipkus et al. (2001) scale and the cipants to make any mathematical computations. These CRT. Finally, in a large sample including individuals from scales, at the face level, appear to measure individual dif- across the adult lifespan, Finucane and Gullion (2010) also ferences in confidence to effectively utilize numeric infor- reported a similar effect size (r = .53) between the CRT and mation in and ability to conduct mathematical operations. numeracy. These findings give us an a priori basis to test One subjective test, the Subjective Numeracy Scale (SNS, whether the CRT items may also serve as valid indicators Fagerlin et al., 2007; Zikmund-Fisher et al., 2007), has of the latent construct of numeracy. been found to correlate with objective measures of numer- Although both the Schwartz-based numeracy scales acy. However, self-assessments of confidence are influ- and the CRT are predicted to be indicators of numeracy, enced by factors in addition to true ability level (Dunning, evidence suggests that these scales may differ in their Heath, & Suls, 2004), leading to potential concerns about ability to assess performance at different levels of the the validity of such assessments. Other numeracy measures latent trait. For instance, even in very numerate popula- have focused on objective performance, testing indivi- tions, such as college students from highly selective duals’ ability to make correct computations and understand universities, a substantial proportion of participants score probabilistic information. These abilities are particularly only 0 or 1 on the three-item CRT. Frederick (2005) important in understanding the risk and benefitinforma- reported that approximately one-third of this total sample tion presented in many “real-world” decision-making con- scored 0 on the CRT and another 28% answered only one texts (e.g., health and financial contexts; Burkell, 2004). question correctly. Further, the modal score of nearly half Although both methods to assess individual differences in of the sub-samples collected was 0. In contrast, median numeracy provide valuable insights, the current study focuses scores on the Lipkus et al. (2001) measure approach the on the objective performance scales that have been used in maximum range of scores (e.g., Peters, Västfjäll, Slovic, the literature. Mertz, Mazzocco, & Dickert, 2006). The skewness of each Schwartz et al. (1997) developed one of the first of these measures may limit the measure’s ability to dis- performance-based numeracy measures. The measure was criminate numeracy level in many populations and may pro- comprised of three items that included one question asses- vide a disadvantage when assessing any linear effects of sing participants’ understanding of chance (i.e., How many numeracy. heads would come up in 1000 tosses of a fair coin?) and two questions asking the participants to convert a percentage to a proportion and vice versa (i.e., the chance of winning a Associations between individual differences in numeracy car is 1 in 1000; what is the percentage of winning tickets and decision making for the lottery?). Lipkus et al. (2001) further expanded this Individual differences in numeracy have been shown to have scale by adding eight questions to the Schwartz et al. numer- important associations with judgment and decision making. acy scale; the additional items were designed to assess a par- Recent reviews of the numeracy literature have found that ticipant’s ability to understand and compare risks (e.g., compared with highly numerate individuals, those lower in Which of the following numbers represents the biggest risk numeracy are more likely to have difficulty judging risks of getting a disease: 1%, 10%, or 5%?) and to accurately and providing consistent assessments of utility, are worse at work with decimal representations, proportions, and frac- reading graphs, show larger framing effects, and are more tions. Moreover, Peters, Hibbard, Slovic, and Dieckmann sensitive to the formatting of probability information (for (2007) further expanded the Lipkus et al. numeracy scale, in- reviews, see Peters, Hibbard et al., 2007; Reyna et al., troducing four additional items in an attempt to expand the 2009). Although numeracy typically leads to better decision range of difficulty; these additional items assess the under- making, there is evidence that the increased numerical standing of base rates as well as the ability to make more processing observed in the highly numerate can lead to complex likelihood calculations. increased affective reactions to numbers, or number compar- Similarly, Frederick (2005) developed a three-item mea- isons, which, in turn, can result in optimal or sub-optimal sure, the cognitive reflection test (CRT), which includes decision making. In an optimal example, Peters et al. items that involve mathematical ability. Although the CRT (2006) asked participants to complete a ratio bias task. They was not explicitly defined as a numeracy test and only were offered a chance to win a prize by drawing a red speculation exists about the underlying dimensions of the jellybean from a bowl. When provided with two bowls from CRT, the items appear to require understanding, manipulat- which to choose, participants often elected to draw from a ing, and using numbers to solve them. Prior research has large bowl containing a greater absolute number, but smaller supported this assertion. For instance, Obrecht, Chapman, proportion, of red beans (9 in 100, 9%) rather than from a and Gelman (2009) found that the CRT was moderately small bowl with fewer red beans but a better winning proba- correlated with SAT quantitative scores (r = .45; see Toplak, bility (1 in 10, 10%) even with the probabilities stated West, & Stanovich, 2011 for similar findings). In a smaller beneath each bowl. Peters et al. (2006) found that 33% and study, Cokely and Kelley (2009) reported a significant 5% of less and more numerate adults, respectively, chose (r = .31) correlation between numeracy and CRT perfor- the larger inferior bowl. Controlling for SAT scores, the mance. Moreover, Liberali, Reyna, Furlan, Stein, and Pardo choice effect remained significant. In addition, compared (2011) reported a moderate to strong correlation (r= .51 with the highly numerate, the less numerate reported less Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 200 Journal of Behavioral Decision Making affective precision about Bowl A’s 9% chance (“How clear a standing on a latent trait or ability level and the difficulty feeling do you have about [its] goodness or badness?”); their of the test item. According to this model, the probability affect to the inferior 9% odds (“How good or bad does [it] that an individual will correctly answer an item is a logistic make you feel?”) was directionally less negative. Peters function of the difference between the individual’strait et al. (2006) concluded that affect derived from numbers level and the extent to which the trait is expressed in the and number comparisons may underlie the highly numerate’s item. Put differently, the higher a person’s ability relative greater number use (cf. the “Bets” experiment in the present to the difficulty of an item, the higher the probability of a paper’s Study 2 and Peters et al., 2006). correct response on that item. When a person’s location Frederick (2005) also found that individuals who per- on the latent trait is equal to the difficulty of the item, there formed well on his CRT were more likely to choose a is, by definition, a .5 probability of a correct response in the future reward of greater value than a smaller immediate Rasch model. Thus, for each item, Rasch analyses can reward. Further, these individuals demonstrated evidence characterize a curve that describes the ability level at which of weaker reflection effects (i.e., risk taking to avoid losses the item maximally discriminates. is greater than risk taking to achieve gains; Kahneman & Tversky, 1979), compared with individuals scoring low in cognitive reflection. High-CRT individuals also were less Overview of the present paper likely to show risk-averse preferences towards gambles In Study 1, we focused on the development of a Rasch-based when the relative expected value between choice options numeracy measure. For our item pool, we used items from favored choosing an uncertain option. Moreover, Toplak the existing scales: the Schwartz et al. (1997) three-item et al. (2011) found that greater CRT performance was measure, the Lipkus et al. (2001) expanded 11-item numer- significantly associated with an index of rational decision acy scale, further expansion of that scale by Peters, Hibbard making comprised of a collection of classic heuristics and et al. (2007), and Frederick’s (2005) CRT. In contrast to a biases tasks. typical short-form scale construction that attempts to reduce a single existing scale, our primary objective was to retain the range of difficulty shown across the scales and to develop a shorter numeracy measure (relative to the entire item pool Development of an abbreviated numeracy scale and to individual measures as possible). The former point A common problem with traditional methods of short-form will allow a broader use of the scale for populations who scale construction has been the reliance on item–total corre- show limited variability on the existing measures. To achieve lations to guide item selection for short forms (i.e., choosing these goals, we incorporated items from all four measures items with the highest item–total correlations). Using such an that encompass a greater range of difficulty than any one of approach renders the researcher unable to ascertain whether the scales. In Study 2, we confirmed the Rasch analysis the short form has removed error variance or narrowed the results on an independent sample and tested the predictive construct (Smith, McCarthy, & Anderson, 2000). In turn, validity of the scale by replicating findings that have been scales developed in this manner are often less able to fully obtained in previous studies. Additionally, we compared assess the scope of the construct in question, thus posing a the predictive validity of our scale with that of the CRT threat to predictive validity of the measure despite retaining and the Lipkus et al. measure. Finally, in Study 3, we further levels of internal consistency similar to the long form tested the predictive and comparative validity of the Rasch- (Smith & McCarthy, 1996). based numeracy scale by examining its associations with risk Alternative scaling methods can allay such concerns. likelihood judgments. Using these techniques, which can be classified as IRT-based scaling, one can develop more efficient psychological tests, in the sense that fewer items are needed to measure a latent STUDY 1 construct while concurrently maintaining the scale’s range of difficulty. Importantly, these methods largely preserve Method psychometric indices such as mean inter-item correlations Participants despite reductions in the number of items, upon which Participants were 1970 subjects collected from three sepa- calculations of coefficient a are based. rate samples. The first sample consisted of 302 community One IRT-based scaling approach was developed by members, equally divided between those with higher edu- Rasch (1960/1993) and has been successfully used to cation and those with lower education. Participants were develop shorter instruments for a wide range of constructs recruited through online and newspaper advertisements. (e.g., Cole, Kaufman, Smith, & Rabin, 2004; Hibbard, The second sample consisted of 163 undergraduates in an Mahoney, Stockard, & Tusler, 2005; Prieto, Alonso, & introductory psychology class. Finally, the third sample Lamarca, 2003; Simon, Ludman, Bauer, Unützer, & was an online study of adults using the American Life Operskalski, 2006). In a Rasch model, responses are viewed Panel (n= 1505). These three samples were merged into a as outcomes of the interaction between a test taker’s single dataset. The sample included 894 women (45.3%) and 1076 men (54.7%). The median age for this sample was 48 years For further reading regarding IRT-based approaches versus a classical test theory approach, see Lord and Novick (1968) and Embretson (1996). (range = 18–89). Highest educational level attained was as Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 201 follows: 3% of participants did not graduate from high school, Path parameters were freely estimated. Both the one-factor 16.3% received a high school diploma, 9.2% attended a and two-factor solutions showed nearly identical fit statistics vocational/trade school or community college, 31.7% had (see Table 1 for fit statistics and factor loadings). Given that completed some college (including those currently enrolled in the two-factor model does not offer an appreciably better a 4-year program), 21.5% received a bachelor’s degree, and model fit and the between-factor correlation was high 17.5% had an advanced degree. The college sample received (r = .85), the more parsimonious explanation of the data course credit for their participation, and individuals in both com- favors adopting a one-factor model. The data suggest that munity samples were financially compensated for participation. the assumption that the item pool represented a coherent, unitary construct is a tenable one; hence, Rasch-based scal- ing is appropriate. Numeracy scales All participants completed the following measures of numeracy: the 11-item Lipkus numeracy scale (Lipkus et al., 2001), which also included the three items from Schwartz et al. (1997), four Rasch analysis additional items developed by Peters, Hibbard et al. (2007), Table 2 shows the item difficulty statistics for all items (i.e., the and three CRT items (Frederick, 2005). proportion of participants correctly answering each item). On average, the Lipkus et al. numeracy items were less difficult, whereas the CRT items were more difficult. Next, we Results and discussion conducted a Rasch analysis on all numeracy and CRT items, Numeracy scales following the procedure of Hibbard et al. (2005). Initially, For the Schwartz et al. three-item scale, Cronbach’s a =.58, mean items were assessed for fit. In general, fit statistics should range inter-item r = .31. Adding the additional eight items of Lipkus from .5 to 1.5 (Linacre, 2002). One item was deleted because et al. to the Schwartz et al. scale resulted in the 11-item Lipkus of a poor outfit statistic. All other items met this criterion. To numeracy measure with Cronbach’s a = .76, mean inter-item reduce the item pool further, items were deleted sequentially r= .23. When adding the four additional items of Peters et al. on the basis of the extent to which the deletion minimally re- to the Lipkus measure, Cronbach’s a = .76, mean inter-item duced the person reliability. Person reliability is a measure of r = .19. For the CRT, Cronbach’s a=.60, mean inter-item r=.34. the ability of the scale to discriminate the sample into different In the current sample, the Peters, Hibbard et al. (2007) and levels of ability and, therefore, is a key construct in measure CRT measures were significantly correlated (r = .49). Further, development using the Rasch technique. After each item was examination of Cronbach’s a of the omnibus 18-item scale deleted, Rasch analysis was rerun to determine the decrease (a = .75) and the mean inter-item correlation (r = .19) for the in person reliability for that deletion. The item that decreased combined items provides initial evidence that the decision to person reliability the least was deleted, and the process was combine these scales was warranted. repeated. In the case of ties, items that were most similar to remaining items in difficultyweredeleted.The processwas stopped when further deletions resulted in unacceptably low Confirmatory factor analysis levels of person reliability (Hibbard et al., 2005). Because Rasch analysis assumes that the latent construct is The final scale consisted of eight items, five from the unitary in nature, the most important threat to this assumption original Lipkus et al. scale (including the three original would occur if the CRT and the items from the other numeracy Schwartz et al. items), two from the CRT scale, and one of scales represented separate factors. Such a finding would the Peters et al. items. Difficulty structure and fit statistics are suggest that the item pool that we intended to use would not shown in Table 3. Fit statistics for all items were deemed to tenably represent a coherent, unitary construct. To test whether be adequate, and person reliability was .63. Cronbach’s a for the CRT and numeracy items load on a unitary factor, we com- the eight-item scale was .71 and mean inter-item was r = .24. pared two separate confirmatory factor analysis (CFA) models: Consistent with the psychological assessment literature, which (i) a single-factor model in which all numeracy and CRT items suggests that the mean inter-item correlation is a more useful loaded on a unitary factor and (ii) a correlated two-factor model index of internal consistency, the observed mean inter-item with CRT items loading on one dimension and numeracy items correlation was acceptable for measuring a broad, higher-order loading on another factor. CFA is widely regarded in the broader psychological assessment literature to be the strongest test for unidimensionality, compared with exploratory factor analysis methods. CFAs were conducted using MPLUS version As suggested by Cortina (1993), we calculated the index of a precision es- 6.1 software. A variance-adjusted weighted least squares esti- timate that estimates the “spread” or standard error of a. Although larger mation was used to estimate dichotomous variables in CFA. values of this estimate cannot definitively state that multidimensionality is present, higher standard errors are a symptom of multidimensionality. Con- versely, an estimate = 0 would suggest unidimensionality. For the reduced From the inter-item correlation matrix, we chose to omit two items from eight-item scale, the precision estimate = .01. For comparison purposes, we these analyses. We chose to omit question 8a because of its strong redun- created a hypothetical scale with the same number of items that included dancy with item 8b (r = .78), compared with its correlation with other items. two orthogonal dimensions, maintaining a roughly equivalent a and mean in- We also conducted a CFA with item 8a instead of 8b, and these findings did ter-item correlation to that of our scale (a = .72 and r = .246). For the hypo- not appreciably differ from those reported. Further, we chose to omit ques- thetical scale, the precision estimate = .06. These findings would suggest tion 14 (SARS item) because it showed no significant associations with other that the spread of the inter-item correlations more closely resembles a unitary items in the item pool at p< .05. scale rather than a multidimensional scale. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 202 Journal of Behavioral Decision Making Table 1. Fit statistics and unstandardized and standardized coefficients for one-factor and two-factor confirmatory factor analysis solutions— Study 1 Two-factor solution One-factor solution Factor 1 Factor 2 Item number Ustd (SE) Std (SE) Ustd (SE) Std (SE) Ustd (SE) Std (SE) Q1. Imagine that we roll a fair, six-sided die 1000 times. 1.0 (.00) 1.0 (.00) .67 (.02) .64 (.02) Out of 1000 rolls, how many times do you think the die would come up as an even number? Q2. In the BIG BUCKS LOTTERY, the chances of 1.10 (.05) .70 (.02) 1.1 (.05) .70 (.02) winning a $10.00 prize are 1%. What is your best guess about how many people would win a $10.00 prize if 1000 people each buy a single ticket from BIG BUCKS? Q3. In the ACME PUBLISHING SWEEPSTAKES, the 1.17 (.05) .76 (.02) 1.18 (.05) .77 (.02) chance of winning a car is 1 in 1000. What percent of tickets of ACME PUBLISHING SWEEPSTAKES win a car? Q4. Which of the following numbers represents the biggest 1.13 (.06) .73 (.03) 1.12 (.06) .73 (.03) risk of getting a disease? (1 in 100, 1 in 1000, or 1 in 10) Q5. Which of the following numbers represents the biggest 1.07 (.06) .69 (.03) 1.07 (.06) .69 (.03) risk of getting a disease? (1%, 10%, or 5%) Q6. If Person A’s risk of getting a disease is 1% in 10 years, 1.16 (.05) .75 (.02) 1.17 (.05) .76 (.02) and Person B’s risk is double that of A’s, what is B’s risk? Q7. If Person A’s chance of getting a disease is 1 in 100 1.11 (.05) .72 (.02) 1.12 (.05) .72 (.02) in 10 years, and person B’s risk is double that of A, what is B’s risk? Q8b. Out of 1000? .92 (.06) .60 (.03) .92 (.06) .60 (.03) Q9. If the chance of getting a disease is 20 out of 100, this 1.03 (.05) .67 (.03) 1.03 (.05) .67 (.03) would be the same as having a _____% chance of getting the disease. Q10. The chance of getting a viral infection is .0005. Out .77 (.05) .49 (.03) .77 (.05) .50 (.03) of 10 000 people, about how many of them are expected to get infected? Q11. Which of the following numbers represents the 1.14 (.07) .74 (.04) 1.14 (.07) .74 (.04) biggest risk of getting a disease? (1 in 12 or 1 in 37) Q12. Suppose you have a close friend who has a lump in .74 (.07) .48 (.04) .74 (.07) .48 (.04) her breast and must have a mammography .. . The table below summarizes all of this information. Imagine that your friend tests positive (as if she had a tumor), what is the likelihood that she actually has a tumor? Q13. Imagine that you are taking a class and your chances 1.05 (.05) .67 (.02) 1.05 (.05) .68 (.02) of being asked a question in class are 1% during the first week of class and double each week thereafter (i.e., you would have a 2% chance in Week 2, a 4% chance in Week 3, an 8% chance in Week 4). What is the probability that you will be asked a question in class during Week 7? Q15 (CRT). A bat and a ball cost $1.10 in total. The bat costs 1.20 (.05) .77 (.02) 1.16 (.05) .85 (.02) $1.00 more than the ball. How much does the ball cost? Q16 (CRT). If it takes five machines 5 minutes to make 1.06 (.05) .68 (.02) 1.0 (.00) .74 (.03) five widgets, how long would it take 100 machines to make 100 widgets? Q17 (CRT). In a lake, there is a patch of lily pads. Every .91 (.05) .58 (.03) .87 (.05) .64 (.03) day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake? Fit statistics X /df 9.980 9.628 CFI .912 .917 TLI .900 .903 RMSEA .068 .066 Note. Standard errors are reported in parentheses. CFI, comparative fit index; RMSEA, root mean square error of approximation; SE, standard error; TLI, Tucker–Lewis index. construct (Briggs & Cheek, 1986; Clark & Watson, 1995). Descriptive statistics. Figure 1 shows frequency distribu- Combined with the CFA results, these results suggest that the tions for the separate measures used: the Lipkus et al. Rasch-based numeracy scale measures the construct in a coher- measure (Panel A), Frederick’s CRT (Panel B), the Peters ent, unitary, and internally consistent manner. et al. measure (Panel C), and the Rasch-modeled scale Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 203 Table 2. Item difficulties for individual items—Study 1 Item Item difficulty Q11. Which of the following numbers represents the biggest risk of getting a disease? (1 in 12 or 1 in 37) 96.1 Q5. Which of the following numbers represents the biggest risk of getting a disease? (1%, 10%, or 5%) 94.5 Q4. Which of the following numbers represents the biggest risk of getting a disease? (1 in 100, 1 in 1000, or 1 in 10) 92.7 Q8a. If the chance of getting a disease is 10%, how many people would be expected to get the disease? Out of 100? 91.2 Q8b. Out of 1000? 88.1 Q9. If the chance of getting a disease is 20 out of 100, this would be the same as having a _____% chance of getting the 84.3 disease. Q1. Imagine that we roll a fair, six-sided die 1000 times. Out of 1000 rolls, how many times do you think the die would come 74.9 up as an even number? Q13. Imagine that you are taking a class and your chances of being asked a question in class are 1% during the first week of 74.3 class and double each week thereafter (i.e., you would have a 2% chance in Week 2, a 4% chance in Week 3, an 8% chance in Week 4). What is the probability that you will be asked a question in class during Week 7? Q6. If Person A’s risk of getting a disease is 1% in 10 years, and Person B’s risk is double that of A’s, what is B’s risk? 71.2 Q2. In the BIG BUCKS LOTTERY, the chances of winning a $10.00 prize are 1%. What is your best guess about how many 70.6 people would win a $10.00 prize if 1000 people each buy a single ticket from BIG BUCKS? Q10. The chance of getting a viral infection is .0005. Out of 10 000 people, about how many of them are expected to get 58.4 infected? Q7. If Person A’s chance of getting a disease is 1 in 100 in 10 years, and person B’s risk is double that of A, what is B’s risk? 55.3 Q14. Suppose that 1 out of every 10 000 doctors in a certain region is infected with the SARS virus; in the same region, 20 out 52.8 of every 100 people in a particular at-risk population also are infected with the virus. A test for the virus gives a positive result in 99% of those who are infected and in 1% of those who are not infected. A randomly selected doctor and a randomly selected person in the at-risk population in this region both test positive for the disease. Who is more likely to actually have the disease? Q3. In the ACME PUBLISHING SWEEPSTAKES, the chance of winning a car is 1 in 1000. What percent of tickets of 34.5 ACME PUBLISHING SWEEPSTAKES win a car? Q16 (CRT). If it takes five machines 5 minutes to make five widgets, how long would it take 100 machines to make 100 32.3 widgets? Q17 (CRT). In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to 31.9 cover the entire lake, how long would it take for the patch to cover half of the lake? Q15 (CRT). A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? 18.7 Q12. Suppose you have a close friend who has a lump in her breast and must have a mammography .. . The table below 9.8 summarizes all of this information. Imagine that your friend tests positive (as if she had a tumor), what is the likelihood that she actually has a tumor? Table 3. Difficulty structure and fit statistics for the eight-item Associations between Rasch-based numeracy scale and numeracy scale—Study 1 demographic variables. Somewhat surprisingly, we found no significant negative correlation between age and numer- Item Difficulty Infit Outfit acy (r=.02, ns). With respect to gender, we found that Q12 89.0 1.10 .90 men performed better than women (point biserial r = .28, CRT1 73.5 .95 .72 p< .001). We also investigated how educational level was CRT3 60.2 .87 .75 associated with numeracy performance. As shown in Table 5, Q3 57.9 .84 .76 Q2 39.6 1.24 1.61 we observed that a disproportionate number of individuals Q1 29.8 .90 .77 with a high school/trade school or less educational level Q9 26.2 1.02 1.16 (low education group) scored 0 on the CRT (64%). In fact, Q8b 15.2 1.05 .79 even among those with a bachelor’s degree or greater (high-education group), the modal response was still 0. In (Panel D). Table 4 shows the descriptive statistics for contrast, we observed that the Lipkus et al. measure showed each scale. As expected, the CRT was positively skewed, a greater negative skew as a function of participants’ educa- whereas the Peters et al. measure and especially the tional level. Nearly 69% of all individuals scored 9 or higher Lipkus et al. measure were negatively skewed. These on the Lipkus et al. measure. The Rasch-based measure, in findings suggest that both the Lipkus et al. measure and comparison, maintained a relatively normal distribution the CRT do not adhere to a normal distribution. On the across different educational levels. For this scale, the major- contrary, performance scores for the Rasch-based numer- ity of respondents scored in the middle of the distribution, acy scale were roughly normally distributed (M = 4.12, with predictably more individuals in the lower- SD = 1.87, median = 4, mode = 4), and the distribution education group scoring worse on the scale, whereas in the was not significantly skewed (.07, z = 0.11, ns). Taken higher-education group, more individuals scored towards together, these results strongly suggest that the CRT, the higher end of the distribution. To further examine these Lipkus et al., and Peters et al. scales, taken separately, may educational level differences with the Rasch-based numeracy be too difficult or too easy, which may limit the sensitivity of measure, we conducted a one-way analysis of variance for the test to accurately detect an individual’s true ability level educational level (three levels: high school/trade school on the latent construct. education or less, some college, and 4-year college graduate Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 204 Journal of Behavioral Decision Making Figure 1. Frequency distributions of individual scales—Study 1. We present the frequency distributions of the cognitive reflection task (CRT; Panel a), the Lipkus et al. numeracy measure (Lipkus; Panel b), the Peters et al. numeracy measure (Peters; Panel c), and the new reduced Rasch-derived model developed in the current study (“Rasch”; Panel d). Table 4. Descriptive statistics for numeracy measures—Study 1 Table 5. Distribution of correct answers for the CRT, Schwartz et al., Lipkus et al., and Rasch-based measures as a function of Scale Mean (SD) Median Mode Skewness educational level—Study 1 CRT (three items) 0.83 (.99) 0 0 .88 Educational level Schwartz et al. 1.8 (1.01) 2 2 –.36 Scale (three items) score High school/trade Some college College grad Lipkus et al. (11 items) 8.15 (2.36) 9 10 –.94 Cognitive reflection test Peters et al. (15 items) 10.48 (2.81) 11 12 –.98 0 64.1 55.8 36.1 Rasch-based 4.13 (1.87) 4 4 .00 1 22.3 24.8 25.5 (eight items) 2 9.5 13.7 23.1 3 4.2 5.8 15.3 Schwartz et al. 0 22.6 12.4 4.1 1 32.3 27.7 17.1 or greater). As expected, we found a significant main effect 2 29.7 36.5 35.9 for educational level (F(2, 1965) = 169.20, p< .001). Those 3 15.4 23.5 43.0 holding a college degree or greater performed best on the Lipkus et al. Rasch-based numeracy measure (M= 4.90, compared with 0–4 16.9 7.7 2.8 4.02 and 3.06 for the some college and high school/trade 5–8 53.0 47.4 28.4 9–11 37.7 44.9 68.8 school or less education groups, respectively). Peters et al. 0–4 6.5 3.1 0.8 5–8 35.9 21.2 7.7 9–12 46.0 57.3 51.0 Convergent validity 13–15 11.6 18.3 40.4 Participants from the community sub-sample also completed the Rasch-based 0–2 37.4 20.8 8.5 Fagerlin et al. (2007) eight-item SNS (a = .86). As expected, we 3–5 50.8 61.2 53.1 found that the Rasch-based numeracy measure was significantly 6–8 11.8 17.9 38.4 correlated with individuals’ subjective perceptions of numeracy Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 205 (r=.55, p< .001). This correlation did not differ from the education, 26% had completed high school or a trade school, Lipkus et al. measure (r = .55) or the Peters et al. 15-item mea- 57% had completed some college or had a college degree, sure (r = .57). It was stronger than both the Schwartz et al. and 14% had completed schooling beyond a 4-year degree. three-item measure (r = .44) and the CRT (r = .43). The Decision Research Web panel participants are compen- Taken together, these results indicate that the Rasch-based sated $15 per hour (prorated). measure was able to reduce the item pool from 18 to eight items, while maintaining the psychometric qualities of the larger item pool and the composite scales. Additionally, we Decision-making tasks found evidence of convergent validity and largely replicated Ratio bias task. As explained earlier, in the ratio bias task previously reported correlations with demographic variables. (Denes-Raj & Epstein, 1994), participants are offered a chance to win a prize by drawing a red jellybean from one of two bowls. One bowl has a greater absolute number of STUDY 2 red beans (i.e., 9 in 100), and the other bowl has a smaller absolute number but a greater proportion of red beans (i.e., Overview 1 in 10). Peters et al. (2006) predicted and found that less The purpose of Study 2 was both to confirm the Rasch results numerate adults drew more often from the affectively appeal- from Study 1 on an independent sample and to test the ing bowl with less favorable objective probabilities whereas predictive validity of the eight-item Rasch-based numeracy the highly numerate drew more often from the objectively scale. We tested performance on three decision-making para- better bowl. Participants responded on a 13-point bipolar digms that previously have been associated with individual scale (1 = strongly prefer 9% bowl,7= no preference, differences in numeracy (Peters et al., 2006). Specifically, 13 = strongly prefer 10% bowl). We predicted that the new we tested whether performance on the Rasch-based numer- measure would also replicate the findings of Peters et al. acy measure predicted the following: (i) the extent of framing effects; (ii) how individuals rated the attractiveness of bets in “Bets” task. Peters et al. (2006) concluded from the ratio bias a “less is more” effect paradigm (Slovic, Finucane, Peters, & task discussed earlier that an affective process may underlie MacGregor, 2002); and (iii) the extent of denominator the greater number use of numbers by the highly numerate. neglect in a ratio bias task. We also compared the predictive If correct, then highly numerate individuals (who are thought validity of the Rasch-based scale with that of two of the to be more likely to draw affective meaning from number component measures, namely the CRT and the Lipkus et al. comparisons) may sometimes overuse numbers and respond measures. One well-established criteria of successful short- less rationally than the less numerate. As a replication of form development is that an abbreviated measure should the work of Peters et al. (2006), the bets task was conducted not result in significant decrements to validity (Smith et al., in a between-subjects design. One group of participants rated 2000). By definition, short-form development attempts to the attractiveness of a no-loss gamble (7/36 chances to win reduce a construct that prior researchers concluded required $9; otherwise, win $0) on a 0–20 scale; a second group rated a more lengthy assessment. If a full-length scale contains a similar gamble with a small loss (7/36 chances to win $9; much irrelevant or invalid content, then one could expect that otherwise lose 5¢). Peters et al. (2006) hypothesized and the validity of the short-form scale would increase. However, found that highly numerate participants rated the objectively if the items contained in full-length assessment are largely worse bet as more attractive and reported more precise affect valid, then one would expect that a short-form measure and more positive affect to the $9 in the loss’ presence. Thus, would result in reduced validity. In this sense, the Rasch- although greater numeracy is generally thought to lead to based measure does not necessarily have to demonstrate better decisions when numeric information is involved, it increased validity compared with other assessments but appears associated sometimes with an overuse of number should, at least, show comparable validity with that observed comparisons, which may subsequently lead to sub-optimal with the other measures. judgments despite higher ability levels. These results were consistent with the highly numerate accessing a richer affective “gist” from numbers (Reyna et al., 2009). Method Thus, we predicted a significant bet condition numeracy Participants interaction. The sample consisted of 899 participants who consented to be part of an ongoing opt-in Web panel administered by Decision Research. The panel members are 65% women Framing. Participants were presented with the exam scores and have a mean age of 38.7 years. Two percent had less than and course levels (200, 300, or 400—indicating varying a high school education, 33% had completed high school or a difficulty levels of classes) of three psychology students trade school, 53% had completed some college or had a and were asked to rate the quality of each student’s work college degree, and 13% had completed schooling beyond on a 7-point scale (3= very poor to +3 = very good). Fram- a 4-year degree. A subset of this sample (n = 723, 70% ing was manipulated between subjects as percent correct or women) was used for testing the predictive validity of the percent incorrect so that “Paul,” for example, was described Rasch-based measure. The mean age of the sample was as receiving either 74% correct on his exam or 26% incor- 39.5 years. One percent had less than a high school rect. Consistent with prior research (Peters et al., 2006), we Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 206 Journal of Behavioral Decision Making predicted the difference in ratings for the positive versus Predictive validity negative frames would be greatest among less numerate We report the analyses based on the Rasch-derived measure participants. Put differently, participants lower in numeracy in the following sections and then discuss the issue of were expected to show more pronounced framing effects comparative validity. than those higher in numeracy. More numerate individuals Ratio bias task. We also replicated the findings from the ratio were expected to transform the provided frame into the bias task (Peters et al., 2006). Consistent with Peters et al. normatively equivalent alternative frame so that they would (2006), more numerate participants had a stronger preference have both frames of information available (Cokely & Kelley, for the objectively better bowl (10% bowl) than those lower in 2009; Peters et al., 2006). numeracy (r(218) = 0.16, p< .01). This result also is consistent All participants completed the framing and bets decision with Stanovich and West’s (2008) finding that cognitive ability tasks. A subset (n= 218) also completed the ratio bias task. was significantly associated with a similar ratio bias problem. Bets task. We regressed the rated attractiveness of the gamble condition (coded 1= no loss,1= small loss), the individual Results and discussion differences in numeracy (mean deviated), and the interaction Rasch analysis between numeracy level and condition. Consistent with prior Rasch analysis was conducted in the same manner as in research (Bateman, Dent, Peters, Slovic, & Starmer, 2007; Study 1. We found that the results matched those obtained Slovic et al., 2002; Stanovich & West, 2008), participants in Study 1, both in terms of the items retained as well as their rated the gamble as more attractive in the small loss condi- relative difficulties (Table 6). Person reliability for this scale tion (F(1, 719) = 60.40, p< .001, b = .92). Participants higher was .65, and Cronbach’s a was .71; the mean inter-item in numeracy also rated the gamble to be more attractive over- correlation for the retained items was .24. all than those lower in numeracy (F(1, 719) = 16.11, p< .001, b = .26). Replicating Peters et al. (2006), the hy- pothesized interaction was also significant, such that partici- Descriptive statistics pants higher in numeracy were more strongly affected by the As predicted, the scores for the Rasch-based numeracy scale small loss in the task (F(1, 719) = 6.50, p< .01, b = .17). were roughly normally distributed (M = 4.07, SD = 1.83, median = 4, mode = 4), and the distribution was not signifi- Framing task. We regressed the average rated student’swork cantly skewed (.07, z = 0.83). quality on frame condition (coded 1= negative,1= positive), numeracy (mean deviated), and a frame numeracy interac- Associations between Rasch-based numeracy and demo- tion. Subjects who did not respond to all stimuli (n= 29) were graphic variables. We found the expected negative correlation excluded from the analyses. As expected, we replicated the between age and numeracy (r=.17, p< .001). Additionally, findings from the framing task reported earlier (Peters et al., we found that men performed better than women (point biserial 2006). We found a significant effect for frame (F(1, r = .31, p< .001). Moreover, we conducted a one-way analysis 690) = 245.07, p< .001, b = .48) and additionally found a of variance to determine differences in numeracy as a function significant main effect for numeracy (F(1, 690) = 4.59, p< .05, of educational level and the association between educational b =.04). Most importantly, we found a significant frame level (three levels: high school/trade school education or less, numeracy interaction (F(1, 690) = 8.34, p< .001, b =.05), some college, and 4-year college graduate or greater). As in which less numerate participants showed larger framing expected, we found a significant main effect for educational effects. These findings replicate the work of Peters et al. level (F(2, 720) = 35.57, p< .001), in that those with a 4-year (2006) and, moreover, are consistent with research suggesting college degree or greater performed better on the Rasch-based that less numerate decision makers focus on non-numeric numeracy measure (M= 4.60, compared with 4.11 and 3.27 for sources of information when constructing preferences the some college and high school/trade school or less education (Dieckmann, Slovic, & Peters, 2009; Peters, Dieckmann, groups, respectively). Overall, these findings replicate the Västfjäll, Mertz, Slovic, & Hibbard, 2009). results reported in Study 1. Table 6. Difficulty structure and fit statistics for the Rasch-based Comparative validity numeracy scale—Study 2 Table 7 shows the results for the three behavioral tasks as a function of different numeracy assessment. Overall, the Difficulty Infit Outfit Rasch-based scale demonstrates comparable validity with Q12 90.5 1.24 .74 that observed with the Lipkus et al. and CRT scales. For CRT1 76.6 .96 .94 the ratio bias task, the Rasch-based measure was more CRT3 60.5 .91 .67 strongly associated with preference for the normatively Q3 54.2 .84 .80 Q2 30.5 .96 .84 correct bowl than the CRT; associations of the Rasch-based Q1 29.7 1.07 1.27 scale and the longer Lipkus et al. scale were about the same. Q9 17.4 .91 .62 For the bets task, we found that the numeracy bet condi- Q8b 14.2 1.07 .98 tion interaction was significant using all three numeracy Note. Higher difficulty scores indicate greater difficulty. measures. To test the extent to which this effect was stronger Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 207 Table 7. Comparative validity analyses regressing decision performance on numeracy scales—Study 2 Ratio bias task Bets task Framing task 2 2 Pearson r Bets condition Numeracy scale Interaction R Framing condition Numeracy scale Interaction R CRT .11 0.91** 0.57** 0.24* .11 0.47** 0.03 0.1** .27 Lipkus et al. .14 0.92** 0.15 0.11* .09 0.47** 0.03* 0.02 .26 Rasch-based .16 0.92** 0.26** 0.16** .10 0.48** 0.04* 0.05** .27 Note. CRT, cognitive reflection test. *p < .05, **p< .01. Each row reflects a separate regression analysis. Unstandardized coefficients and effect sizes are shown for each independent variable. for the Rasch-based numeracy measure, we calculated and interactions, but it showed roughly equal predictive validity compared the effect size estimates for the differences for the ratio bias task. Compared with the CRT, the Rasch- between bet conditions (i.e., bets effect) as a function of both based measure showed stronger effects with respect to the numeracy level (i.e., either high or low numeracy) and bets task and the ratio bias test but only showed modest specific numeracy measure. Essentially, these analyses com- effect size differences for the framing task. It is possible that pare the simple effects of the interaction in terms of a linear the use of a more general population, not to mention one contrast for numeracy, as construed by the different mea- collected over the Internet, dampened expected relationships sures. For the Rasch-based measure, the effect size of the between numeracy and decision effects, thus reducing the bets effect for those scoring highest in numeracy (seven to chances of finding stronger scale-based differences. For eight items correct; d = 1.06) was nearly four times as large example, the materials in the framing task were originally as the effect size observed for those scoring lowest on the developed to be meaningful to the undergraduate population Rasch-based numeracy measure (zero to two items correct; tested by Peters et al. (2006), but the course level information d = .27). Similarly, those who scored 0 on the CRT showed (which was provided without further explanation) may have weaker effect sizes (d = .42) than those who answered all been confusing for the more general population studied here. three CRT items correctly (d = .70). We also observed a Second, although Internet data collection is a valid means of stronger effect size for those scoring highest on the Lipkus obtaining psychological data, data from Internet samples are et al. measure (9–11 items correct, d = .65) than for indivi- often noisier because of the lack of environmental control duals scoring the lowest on numeracy (zero to four items (Gosling, Vazire, Srivastava, & John, 2004). correct, d = .06). Thus, although we found the significant Although these results are encouraging, they may raise a predicted interaction effect for all three scales, these results potential question regarding the advantages of the Rasch- suggest that these effects were strongest when assessed with based numeracy scale. As we have demonstrated in the past the Rasch-based numeracy scale. two studies, the primary advantage of the Rasch-based scale For the framing task, we observed interaction effects for is that it offers a normal distribution in the general popula- both the CRT and Rasch-based measures, but not for the tion, compared with the Lipkus et al. measure and the Lipkus et al. measure. To explore these interaction effects CRT, both of which are significantly skewed. Because in greater depth, we again calculated and compared effect skewness can attenuate linear associations between variables, size estimates of framing effects for high and low scorers we predicted that the Rasch-based scale would be a stronger on the CRT and Rasch-based measures. Individuals who linear predictor than either of the component scales. Our scored lowest on the Rasch-based measure showed very Study 2 results suggest that this will not always be the case. strong framing effects (d = 1.42) even more so than those In Study 3, we examine this issue further within the context scoring 0 on the CRT (d = 1.33). In contrast, we found that of risk perception. individuals scoring highest on the CRT showed about the same framing effects (d = .67) as did those scoring the highest on the Rasch-based numeracy scale (d= .65). Thus, STUDY 3 compared with results of the CRT, these results provide evidence that using the Rasch-based measure showed a slight In this study, we wanted to further explore the comparative advantage over the CRT when predicting framing effects for predictive validity of the Rasch-based scale using two addi- the less numerate, which was in the predicted direction of the tional tasks. We turned our attention to understanding how interaction. numeracy may predict perceived risks. Recent work has Together, these results provide evidence that the Rasch- demonstrated that numeracy is related to likelihood and risk based numeracy scale shows comparable validity with both perceptions. For instance, when presented with numerical the Lipkus et al. measure and the CRT. The Rasch-based probability information, less numerate participants tend to measure showed better distributional qualities than the CRT think that negative low-probability events are more likely or the Lipkus et al. measure and also demonstrated some to occur, compared with more numerate participants (e.g., evidence for stronger predictive validity than these existing Dieckmann et al., 2009; also Lipkus, Peters, Kimmick, measures. However, we acknowledge that this evidence is Liotcheva, & Marcom, 2010). This typical finding may be somewhat mixed. Compared with the Lipkus et al. measure, due to the less numerate responding more to non-numeric we found the Rasch-based measure to show stronger simple and often emotional information about risks such as cancer effects when we decomposed the framing and bets task (Peters, 2012; Reyna et al., 2009). Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 208 Journal of Behavioral Decision Making For this study, we examined whether the Rasch-based to examine the associations between the different numeracy measure would be a stronger linear predictor for outcomes scales and likelihood perceptions in the two scenarios, we related to the explicit understanding and use of probabilistic do not report the effect of the within-subject condition but estimation than is afforded by either the CRT or the Lipkus instead focus on the correlational analyses for this study. et al. 11-item measure. The association between understand- ing risk information and numeracy appears to be a very robust phenomenon (see Reyna et al., 2009, for a review). Understanding how numeracy is associated with risk percep- Results and discussion tions is important in many domains, including financial and Table 8 shows the correlations between perceived likelihood health decisions. For instance, if individuals lower in numer- and the three different numeracy scales for the full sample, acy misinterpret the risks of treatment options, they may act and for the lower-education (vocational school or less) and in a suboptimal way. Similarly, being able to accurately iden- higher-education (some college or more) groups. For both tify true numeracy abilities may enable risk communicators scenarios, the full-sample correlations were higher with the to develop more customized and effective communication Rasch-based measure, although each of the numeracy scales messages. is significantly negatively correlated with perceived likeli- hood, as expected. However, we anticipated the primary benefit of the Rasch-based measure to be in identifying linear Method effects across a range of educational levels. In particular, Participants given the difficulty of the CRT, we expected attenuated cor- The sample (N = 165) was drawn from the Decision Research relations in the lower-education group. As predicted, the Web Panel and was 57.6% women (mean age = 39.53 years). results demonstrate that the CRT showed the smallest corre- Approximately 25% of the sample had a high school educa- lations across both scenarios, with the Lipkus et al. and tion or less, 4% had some vocational training, 28% had Rasch-based measures showing comparable effect sizes. In attended some college, 33% were college graduates, and the higher-education group, all of the numeracy scales were 10% had attended graduate or professional school after inversely correlated with risk perceptions; the Rasch-based college. measure shows the largest effect size for both scenarios. As expected, the Rasch-based measure showed the strongest and most consistent effects in the full sample and Procedure across the two education groups. The CRT consistently In a previous session, participants completed the CRT, the demonstrated low correlations in the lower-education group. Lipkus et al. numeracy measure, and the additional items Moreover, both the Lipkus et al. measure and the CRT from the Peters et al. measure. Participants each read two showed lower correlations with risk perceptions than did different scenarios that included a narrative discussion of the Rasch-based measure in the higher-education group. available evidence relating to an event as well as a numerical Study 3 demonstrates some distinct advantages of the new probability assessment made by an expert. The first scenario Rasch-based measure. First, the Rasch-based measure described a potential terrorist attack, and the second scenario demonstrates the most consistent level of correlations across described the possible extinction of salmon in a Pacific various educational levels. We attribute this advantage to the Northwest river. The likelihood of each event was presented fact that performance on the Rasch-based measure is as either 5% or 20%. Each participant read both scenarios normally distributed in the general population. Second, and (their order was counterbalanced across subjects), and the perhaps more importantly, the Rasch-based measure overall numerical probability attached to each scenario was counter- shows stronger predictive validity in these judgments and balanced separately across subjects (i.e., numerical probabil- decisions, compared with the other two measures. Compared ity was a within-subject manipulation). After reading each with the Rasch-based measure, the CRT showed limited scenario, participants reported their own perceptions of the predictive validity, especially in the lower-education sample. likelihood of the attack or salmon extinction on a scale rang- In contrast, the Lipkus et al. measure showed evidence of ing from 0% to 100%. Because the goal of this analysis was reduced predictive validity in higher-education samples. Table 8. Correlations between risk perceptions and numeracy in the full sample and as a function of educational level—Study 3 Full sample Lower-education group Higher-education group Terrorist attacks CRT (three items) –.24** –.13* –.21* Lipkus et al. (11 items) –.34** –.34* –.29** Rasch (eight items) –.41** –.38** –.36** Salmon extinction CRT (three items) –.35** –.11 –.33** Lipkus et al. (11 items) –.38** –.31* –.35** Rasch-based (eight items) –.44** –.27 –.43** Note. CRT, cognitive reflection test. p< .10, *p < .05, **p< .01. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 209 GENERAL DISCUSSION objective assessments of numeracy (Fagerlin et al., 2007; Zikmund-Fisher et al., 2007; although see Reyna et al., A growing body of research has demonstrated that individual 2009, p. 955, for an excellent discussion regarding concerns differences in numeracy are associated with how individuals about the accuracy of individual’s subjective assessments of perceive risks, understand charts and graphs, and ultimately their own numeracy). Because the SNS was administered af- make decisions. However, measurement of this construct ter the objective numeracy measures, we cannot rule out the has varied. To our knowledge, this study is the first to present possibility that individuals reflected on the perceived ease/ the psychometric properties of several popular numeracy difficulty of the numeracy items, which, in turn, may have in- measures across a diverse sample in terms of age and educa- flated the correlation between numeracy and SNS. However, tional level (although see Liberali et al., 2011 for a similar our results are consistent with those reported by Fagerlin examination with Brazilian and US college-age samples, et al. (2007), who had subjects complete the SNS first. Fi- which adds to the literature from a cross-cultural perspec- nally, our data cannot directly speak to any differences in tive). Inspection of the distributional characteristics of these predictive power between objective and subjective numeracy measures demonstrates that the previously used measures scales, but we believe that this is an important question that are very skewed, which may limit their ability to discriminate future research should address. an individual’s trait level of numeracy. In general, the CRT We acknowledge that this scale may not include a appears to be very difficult, whereas the Lipkus et al. complete range of difficulty. Because of our study’sde- (2001) measure appears to be too easy for most individuals, sign, our results are limited by the number of items that leading to non-normal score distributions, an issue that prior were included in the initial item pool. In fact, examination research has largely addressed by using median splits or of the Rasch-based item difficulties would suggest that extreme group designs. We do not mean to either diminish more items could be added to more finely differentiate or criticize the contributions that have been made using individuals’ numeracy ability. Cokely et al. (2012), for in- these scales. In fact, these studies reinforce past research stance, applied a decision tree approach to develop a efforts supporting and strengthening the validity of extant computer-adaptive test for the highly numerate. Future re- measures. search using IRT principles can help to create adaptive In the current study, we used Rasch analysis to develop a tests that may assess numeracy across a wider range of scale that offers researchers an alternative means to assess ability levels. individual differences in numeracy, compared with classic Another implication of only using existing measures is test theory approaches (Embretson, 1996). The items that it restricts our ability to conduct a more extensive analy- retrieved, as well as the relative difficulty scaling of these sis of potential multidimensionality of the numeracy con- items, were identical across two large independent samples struct (Liberali et al., 2011). If we had started with a much of individuals ranging from 18 to 89 years of age. Moreover, larger initial item pool, it might be reasonable to expect the Rasch-based numeracy scale retained a wide range of multiple correlated facets of numeracy to be extracted that item difficulties. Further, we found that this scale approached would represent sub-competencies of numeracy. Although a normal distribution in both samples, which we believe will previous research has typically added items on the basis ultimately lead researchers to treat numeracy as a continuous of their face validity, we recommend that future scale variable rather than as a dichotomous variable. We feel that construction efforts be based instead on accepted scale con- this is an important contribution, given the potential limita- struction guidelines widely used in the assessment literature tions involved with dichotomizing variables (MacCallum, (e.g., Clark & Watson, 1995). This process begins with the Zhang, Preacher, & Rucker, 2002). generation of an item pool based on theoretical considera- Cronbach and Meehl’s (1955) classic article first identi- tions, such as those discussed in literature reviews and fied construct validity (i.e., how trustworthy is the score empirical inquiries (see Dehaene, 1997, and Reyna et al., and its interpretation) as the most important form of validity 2009, for influential reviews). Briefly, researchers should de- in psychological tests. Construct validity of a measure should velop an over-inclusive item pool of various items and diffi- be treated as a continual process that involves researchers culty levels. Numeracy skills range from, but are not limited testing the predictive validity of the measure, as well as to, simple mathematical operations (e.g., addition, multipli- assessing convergent and discriminant validity. The Rasch- cation) to logic and quantitative reasoning, as well as com- based measure demonstrates predictive validity comparable prehension of probabilities, proportions, and fractions. From with that obtained in previous numeracy studies. In fact, this item pool, researchers would subsequently conduct mul- when directly comparing the Rasch-based scale with its pre- tiple administrations of the items, refining the measure by decessors, we found that the Rasch-based measure predicted removing ambiguous/poorly constructed and misfit items as well as or better than the CRT and the Lipkus et al. along the way. Scale development in this manner can result measure across two separate studies. in the ability to make more fine-grained distinctions in We also found that the Rasch-based numeracy measure numeracy across persons and to more extensively identify was strongly correlated with the SNS of Fagerlin et al. sub-competencies/facets of numeracy. From there, research- (2007), supporting the convergent validity of the measure. ers will be able to better test if certain sub-competencies of Although the SNS was not intended to be a substitute for numeracy are differentially important to particular types of assessing precise numeracy abilities, this finding reinforces judgment and decision problems. Understanding the multiple prior research supporting a link between subjective and potential facets of numeracy is an important and necessary Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 210 Journal of Behavioral Decision Making future research direction that would be most properly from 18 to eight items, creating a measure that is compa- examined within the context of the scale construction/factor rable, or even better, in terms of predictive validity and analytic methods that we have outlined. internal consistency with that which would have been However, we offer one important caveat with respect obtained by administering either all 18 items or one of to the assessment of multidimensionality. As a conse- the component scales. quence of adequately developing measures that assess As the study of numeracy in the decision-making litera- numeracy sub-competencies in the manner that we have ture continues to grow, the importance of being able to outlined, this method would add many more items to a appropriately discriminate individual differences in numer- numeracy scale. It would especially be the case if one acy also increases. The current study offers a measure that wanted to adequately scale item difficulty and ability researchers interested in the associations between numeracy levels for each sub-competency. At the expense of being and human decision processes can use to assess individual more comprehensive, it would undoubtedly add more differences across a wider range of target populations time to assessments than even the longest numeracy mea- compared with previous measures. sure that currently exists. Thus, researchers who may have limited assessment time or resources available (e.g., researchers interested in assessing numeracy in large ACKNOWLEDGEMENTS nationally representative surveys) may opt for a shorter instrument, sacrificing construct fidelity for a broader The authors would like to gratefully acknowledge support bandwidth. We stress that it is vital for researchers to from the National Science Foundation, grant numbers have both types of measures in their assessment arsenal; SES-0820197 and SES-0517770 to Dr. Peters, SES- ultimately, though, the use of each is dependent on the 0901036 to Dr. Burns, SES-0925008 to Dr. Dieckmann, inquiry at hand. and SES-082058 to Dr. Weller. Data collection for Study We believe that our Rasch-based measure provides a 2 was supported by the National Institute on Aging, grant valuable advance in the assessment of numeracy. Our numbers R01AG20717 and P30AG024962. All views results reinforce that our reduced-item scale measures expressed in this paper are those of the authors alone. numeracy in a coherent, unitary manner, across a wide range of ability levels. Of particular interest, we used CFA to directly test whether the CRT and the numeracy REFERENCES items comprised different underlying factors. We did not find this to be the case. At the surface, these results Bateman, I. A., Dent, S., Peters, E., Slovic, P., & Starmer, C. appear to be in contrast with those reported by Liberali (2007). The affect heuristic and attractiveness of simple et al. (2011), who, across two samples, concluded that gambles. Journal of Behavioral Decision Making, 20,365–380. items from the scales of Lipkus et al. (2001) and Frederick DOI: 10.1002/bdm.558 (2005) produced four to five factors based on exploratory Briggs, S. R., & Cheek, J. M. (1986). The role of factor analysis in the development and evaluation of personality scales. Journal of factor analysis. Moreover, in one of their two studies, the Personality, 54, 106–148. CRT and objective numeracy items loaded onto different Burkell, J. (2004). What are the chances? Evaluating risk and factors. Because the single-factor un-rotated solutions, a direct benefit information in consumer health materials. Journal of measure of the common construct defined by the item pool, the Medical Library Association, 92, 200–208. were not reported, we cannot directly compare results of the Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in scale development. Psychological Assessment, 7, current study with those of Liberali et al. (2011). However, 309–319. given that reported correlations between the CRT and the Cohen, J. (1992). A power primer. Psychological Bulletin, 112, Lipkus et al. numeracy measure by Liberali et al. (2011) were 155–159. DOI: 10.1037/0033-2909.112.1.155 indicative of a moderate to large effect size (range = .40–.51; Cokely, E. T., Galesic, M., Schulz, E., Ghazal, S., & Garcia-Retamero, R. mean r = .45), it seems reasonable that a one-factor solution (2012). Measuring risk literacy: The Berlin Numeracy Test. Judgment and Decision Making, 7,25–47. may also have been observed in confirmatory factor analyses Cokely, E. T., & Kelley, C. M. (2009). Cognitive abilities and supe- of their data as well. rior decision making under risk: A protocol analysis and process In contrast to exploratory factor analysis as a data model evaluation. Judgment and Decision Making, 4,20–33. reduction tool, the Rasch analysis identifies a hypothetical Cole, J. C., Kaufman, A. S., Smith, T. L., & Rabin, A. S. (2004). unidimensional line on which items and persons are scaled Development and validation of a Rasch-derived CES-D short form. Psychological Assessment, 16, 360–372. on the basis of item difficulty and ability level. In turn, Cortina, J. M. (1993). What is coefficient alpha? An examination misfit items represent items that do not contribute to better of theory and applications. Journal of Applied Psychology, 78, identification of the construct. Hence, the reduced scale 98–104. requires fewer items to estimate the latent construct with Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in the same range of ability level as the full item pool. In our psychological tests. Psychological Bulletin, 52, 281–302. Dehaene, S. (1997). The number sense: How the mind creates study, we were able to substantially reduce an item pool mathematics. New York: Oxford University Press. Denes-Raj, V., & Epstein, S. (1994). Conflict between intuitive and rational processing: When people behave against their better judgment. Journal of Personality and Social Psychology, 66, Note that the Kaiser rule has the potential to overestimate the number of dimensions to retain (Zwick & Valicer, 1986). 819–829. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 211 Dieckmann, N. F., Slovic, P., & Peters, E. M. (2009). The use of Obrecht, N. A., Chapman, G. B., & Gelman, R. (2009). An encoun- narrative evidence and explicit likelihood by decision makers ter frequency account of how experience affects likelihood varying in numeracy. Risk Analysis, 29, 1473–1487. DOI: estimation. Memory & Cognition, 37, 632–643. 10.1111/j.1539-6924.2009.01279 Peters, E. (2012). Beyond comprehension: The role of numeracy in Dunning, D., Heath, C., & Suls, J. M. (2004). Flawed self-assessment: judgments and decisions. Current Directions in Psychological Implications for health, education, and the workplace. Psycholog- Science, 21,31–35. ical Science in the Public Interest, 5(3), 69–106. Peters, E., Dieckmann, N. F., Dixon, A., Hibbard, J. H., & Mertz, C. Educational Testing Service. (1992). National Adult Literacy Survey K. (2007). Less is more in presenting quality information to (NALS). Princeton, NJ: ETS. Retrieved from http://nces.ed.gov/ consumers. Medical Care Research and Review, 64, 169–190. pubsearch/pubsinfo.asp?pubid=199909 (14 August 2011). DOI: 10.1177/10775587070640020301 Embretson, S. E. (1996). The new rules of measurement. Psycho- Peters, E., Dieckmann, N. F., Västfjäll, D., Mertz, C. K., Slovic, P., & logical Assessment, 8, 341–349. Hibbard, J. H. (2009). Bringing meaning to numbers: The impact Estrada, C., Barnes, V., Collins, C., & Byrd, J. C. (1999). Health of evaluative categories on decisions. Journal of Experimental literacy and numeracy. Journal of the American Medical Associ- Psychology. Applied, 15,213–227. DOI: 10.1037/a0016978 ation, 282, 527. Peters, E., Hibbard, J. H., Slovic, P., & Dieckmann, N. F. Fagerlin, A., Zikmund-Fisher, B., Ubel, J., Peter, A., Jankovic, A., (2007). Numeracy skill and the communication, comprehen- Derry, H. A., & Smith, D. M. (2007). Measuring numeracy with- sion, and use of risk and benefit information. Health Affairs, out a math test: Development of the subjective numeracy scale. 26, 741–748. Medical Decision Making, 27, 672–680. DOI: 10.1177/ Peters, E., Västfjäll, D., Slovic, P., Mertz, C., Mazzocco, K., & 0272989X07304449 Dickert, S. (2006). Numeracy and decision making. Psycholog- Finucane, M. L., & Gullion, C. M. (2010). Developing a tool for ical Science, 17, 407–413. measuring the decision-making competence of older adults. Prieto, L., Alonso, J., & Lamarca, R. (2003). Classical test theory Psychology and Aging, 25, 271–288. DOI: 10.1037/a0019106 versus Rasch analysis for quality of life questionnaire reduction. Frederick, S. (2005). Cognitive reflection and decision making. Health and Quality of Life Outcomes, 1, 27. DOI: 1186/1477-7525 Journal of Economic Perspectives, 19,25–42. Rasch, G. (1993). Probabilistic models for some intelligence and Gosling, S. D., Vazire, S., Srivastava, S., & John, O. P. (2004). attainment tests. Chicago: Mesa Press (original work published Should we trust Web-based studies? A comparative analysis of in 1960). six preconceptions about Internet questionnaires. American Reyna, V. F., Nelson, W., Han, P., & Dieckmann, N. F. (2009). Psychologist, 59,93–104. How numeracy influences risk reduction and medical decision Hibbard, J. H., Mahoney, E. R., Stockard, J., & Tusler, M. (2005). making. Psychological Bulletin, 135, 943–973. Development and testing of a short form of the patient activa- Schwartz, L. M., Woloshin, S., Black, W. C., & Welch, H. G. (1997). tion measure. Health Research and Educational Trust, 40, The role of numeracy in understanding the benefit of screening 1918–1930. mammography. Annals of Internal Medicine, 127, 966–972. Hibbard, J. H., Slovic, P., Peters, E., Finucane, M. L., & Tusler, M. Simon, G. E., Ludman, E. J., Bauer, M. S., Unützer, J., & (2001). Is the informed-choice policy approach appropriate for Operskalski, B. (2006). Long-term effectiveness and cost of a Medicare beneficiaries? Health Affairs, 20, 199–203. systematic care program for bipolar disorder. Archives of Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis General Psychiatry, 63, 500–508. of decision under risk. Econometrica, 47, 263–291. Slovic, P., Finucane, M., Peters, E., & MacGregor, D. G. Kirsch, I. S., Jungeblut, A., Jenkins, L., & Kolstad, A. (2002). Adult (2002). The affect heuristic. In T. Gilovich, D. Griffin, & literacy in America: A first look at the findings of the National D. Kahneman (Eds.), Heuristics and biases: The psychology Adult Literacy Survey (3rd ed., Vol. 201). Washington, DC: of intuitive judgment (pp. 397–420). New York: Cambridge National Center for Education, US Department of Education. University Press. Kutner, M., Greenberg, E., Jin, Y., & Paulsen, C. (2006). The health Smith, P., & McCarthy, G. (1996). The development of a semi- literacy of America’s adults: Results from the 2003 National structured interview to investigate the attachment-related experi- Assessment of Adult Literacy (NCES 2006-483). Washington, ences of adults with learning disabilities. British Journal of DC: National Center for Education Statistics, US Department Learning Disabilities, 24, 154–160. of Education. Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the Liberali, J. M., Reyna, V. F., Furlan, S., Stein, L. M., & Pardo, S. T. sins of short-form development. Psychological Assessment, 12, (2011). Individual differences in numeracy and cognitive reflec- 102–111. tion, with implications for biases and fallacies in probability Stanovich, K. E, & West, R. F. (2008). On the relative indepen- judgment. Journal of Behavioral Decision Making. DOI: dence of thinking biases and cognitive ability. Journal of 10.1002/bdm.752 Personality and Social Psychology, 94, 672–695. Linacre, J. M. (2002). What do infit and outfit, mean-square and Thaler, R. H., & Sunstein, C. R. (2003). Libertarian paternalism. standardized mean? Rasch Measurement Transactions, 16, 878. American Economic Review, 93, 174–179. Lipkus, I. M., Peters, E., Kimmick, G., Liotcheva, V., & Marcom, Toplak, M. E., West, R. F., & Stanovich, K. E. (2011). The Cogni- P. (2010). Breast cancer patients’ treatment expectations tive Reflection Test as a predictor of performance on heuristics- after exposure to the decision aid program Adjuvant Online: and-biases tasks. Memory and Cognition, 39, 1275–1289. DOI: The influence of numeracy. Medical Decision Making, 30, 10.3758/s13421-011-0104-1 464–473. Woloshin, S., Schwartz, L. M., & Welch, H. G. (2004). The Lipkus, I. M., Samsa, G., & Rimer, B. K. (2001). General perfor- value of benefit data in direct-to-consumer drug ads. mance on a numeracy scale among highly educated samples. Health Affairs, W4,234–245. DOI: 10.1377/hlthaff.W1374. Medical Decision Making, 21,37–44. 1234 Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental Zikmund-Fisher, B. J., Smith, D. M., Ubel, P. A., & Fagerlin, A. test scores. Reading, MA: Addison-Wesley. (2007). Validation of the Subjective Numeracy Scale (SNS): MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. Effects of low numeracy on comprehension of risk communications (2002). On the practice of dichotomization of quantitative and utility elicitations. Medical Decision Making, 27, 663–671. variables. Psychological Methods, 7,19–40. DOI: 10.1177/0272989X07303824 National Center for Education Statistics (NCES). (2003). National Zwick, W. R., & Velicer, W. F. (1986). Comparison of five rules for Assessment of Adult Literacy (NAAL). http://nces.ed.gov/naal/ determining the number of components to retain. Psychological (15 August 2011) Bulletin, 99, 432–442. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 212 Journal of Behavioral Decision Making Authors’ biographies: William J. Burns is a research scientist at Decision Research (Eugene, OR), whose current work focuses on modeling public Joshua A. Weller is currently a research scientist at Decision response and the subsequent economic impacts of disasters (special Research (Eugene, OR). His research focuses on how the ability emphasis on terrorism) on urban areas. to make advantageous decisions develops throughout the life- Ellen Peters is an associate professor in the Psychology Depart- span. Additionally, Dr. Weller is interested in understanding ment at The Ohio State University. She studies decision making how individual differences relate to risk taking and decision as an interaction of characteristics of the decision situation and char- making. acteristics of the individual. Her research interests include decision Nathan F. Dieckmann is a research scientist at Decision Research making, affective and deliberative information processing, emotion, (Eugene, OR). He conducts basic and applied research in decision risk perception, numeracy, and aging. making, risk communication, and statistical methodology. Martin Tusler is a research specialist in the Psychology Depart- Authors’ addresses: ment at The Ohio State University. He studies medical decision making, scale construction, and numeracy. Joshua A. Weller, Nathan F. Dieckmann, C. K. Mertz, and William J. Burns, Decision Research, Eugene, OR, USA. C. K. Mertz is a data analyst at Decision Research (Eugene, OR). Her research interests include multivariate statistical methods, risk Martin Tusler and Ellen Peters, Department of Psychology, The perception, and affect. Ohio State University, Columbus, OH, USA. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Behavioral Decision Making Pubmed Central

Development and Testing of an Abbreviated Numeracy Scale: A Rasch Analysis Approach

Journal of Behavioral Decision Making , Volume 26 (2) – Mar 15, 2012

Loading next page...
 
/lp/pubmed-central/development-and-testing-of-an-abbreviated-numeracy-scale-a-rasch-3ecQ2TUVya

References (105)

Publisher
Pubmed Central
Copyright
Copyright © 2012 John Wiley & Sons, Ltd.
ISSN
0894-3257
eISSN
1099-0771
DOI
10.1002/bdm.1751
Publisher site
See Article on Publisher Site

Abstract

Journal of Behavioral Decision Making, J. Behav. Dec. Making, 26: 198–212 (2013) Published online 15 March 2012 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/bdm.1751 Development and Testing of an Abbreviated Numeracy Scale: A Rasch Analysis Approach 1 1 2 1 1 2 JOSHUA A. WELLER *, NATHAN F. DIECKMANN , MARTIN TUSLER ,C.K.MERTZ , WILLIAM J. BURNS and ELLEN PETERS Decision Research, Eugene, OR, USA Department of Psychology, The Ohio State University, Columbus, OH, USA ABSTRACT Research has demonstrated that individual differences in numeracy may have important consequences for decision making. In the present paper, we develop a shorter, psychometrically improved measure of numeracy—the ability to understand, manipulate, and use numerical information, including probabilities. Across two large independent samples that varied widely in age and educational level, participants completed 18 items from existing numeracy measures. In Study 1, we conducted a Rasch analysis on the item pool and created an eight-item numeracy scale that assesses a broader range of difficulty than previous scales. In Study 2, we replicated this eight-item scale in a separate Rasch analysis using data from an independent sample. We also found that the new Rasch-based numeracy scale, compared with previous measures, could predict decision-making preferences obtained in past studies, supporting its predictive validity. In Study, 3, we further established the predictive validity of the Rasch-based numeracy scale. Specifically, we examined the associations between numeracy and risk judgments, compared with previous scales. Overall, we found that the Rasch-based scale was a better linear predictor of risk judgments than prior measures. Moreover, this study is the first to present the psychometric properties of several popular numeracy measures across a diverse sample of ages and educational level. We discuss the usefulness and the advantages of the new scale, which we feel can be used in a wide range of subject populations, allowing for a more clear understanding of how numeracy is associated with decision processes. Copyright © 2012 John Wiley & Sons, Ltd. key words numeracy; decision making; individual differences; Rasch analysis; cognitive reflection test Decision making today involves making sense of a morass difficulty using numerical information to compare Medicare of information from various sources, such as insurance compa- health plans. nies, financial advisors, and marketers (Hibbard, Slovic, Peters, Although there are several numeracy measures available Finucane, & Tusler, 2001; Thaler & Sunstein, 2003; Woloshin, to researchers (e.g., Lipkus, Samsa, & Rimer, 2001; Peters, Schwartz, & Welch, 2004). Today’s consumers need an under- Dieckmann et al., 2007; Schwartz, Woloshin, Black, & standing of numbers and basic mathematical skills to use Welch, 1997), the distributional characteristics of these numerical information presented in text, tables, or charts. scales previously reported suggest that the items in these However, consumers differ considerably in their ability to measures may possess a limited range of difficulty (Cokely & understand and use such information (Peters, Dieckmann, Kelley, 2009; Cokely, Galesic, Schulz, Ghazal, & Garcia- Dixon, Hibbard, & Mertz, 2007). Numbers are generally pro- Retamero, 2012). Administering a measure that does not vided to facilitate choices, but they can be confusing or difficult match the range of ability level of the population of interest, to understand and use for even the most motivated and skilled which may be the case for highly numerate populations such individual, and appear to be more so for those who are less as college students or ones that are less numerate (e.g., older skilled. adults or those with lower educational levels), potentially Research has demonstrated that individual differences in limits the test’s ability to discriminate ability level. Put dif- numeracy, the ability to comprehend and manipulate proba- ferently, the items in the measure essentially become redun- bilistic and other numeric information, may have important dant as respondents answer all items correctly in the former consequences for decision making (Estrada, Barnes, Collins, case and incorrectly in the latter. Therefore, a numeracy & Byrd, 1999; Reyna, Nelson, Han, & Dieckmann, 2009). measure with a greater range of difficulty would be desir- An estimate from the National Adult Literacy Survey able. In the current study, we developed such a measure by (Educational Testing Service, 1992) suggests that approxi- adopting an item response theory (IRT) approach. Using mately half of the US population has only very basic or scaling procedures developed by Rasch (1960/1993), we below basic quantitative skills (Kirsch, Jungeblut, Jenkins, created a measure of numeracy derived from existing mea- & Kolstad, 2002). The National Assessment of Adult Literacy sures shown to be related to decision-making behavior. (Kutner, Greenberg, Jin, & Paulsen, 2006; NCES, 2003) demonstrated similar results. In addition, these problems may be particularly acute for older adults. For example, Existing measures of numeracy Hibbard et al. (2001) found that a large proportion of older Researchers have measured numeracy in various ways often adults (more than half of those over age 65) had substantial because of differences in their specific research interests and domains of study (Reyna et al., 2009). Some scales have focused on subjective perceptions of one’s own numerical *Correspondence to: Joshua Weller, Decision Research, 1201 Oak Street, Suite 200, Eugene, OR 97401, USA. E-mail: [email protected] abilities (Fagerlin et al., 2007; Woloshin et al., 2004; Copyright © 2012 John Wiley & Sons, Ltd. J. A. Weller et al. Rasch-Based Numeracy Scale 199 Zikmund-Fisher, Smith, Ubel, & Fagerlin, 2007) in an and r = .40 in Brazilian and US samples, respectively; Cohen, attempt to measure numeracy without directly asking parti- 1992) between the 11-item Lipkus et al. (2001) scale and the cipants to make any mathematical computations. These CRT. Finally, in a large sample including individuals from scales, at the face level, appear to measure individual dif- across the adult lifespan, Finucane and Gullion (2010) also ferences in confidence to effectively utilize numeric infor- reported a similar effect size (r = .53) between the CRT and mation in and ability to conduct mathematical operations. numeracy. These findings give us an a priori basis to test One subjective test, the Subjective Numeracy Scale (SNS, whether the CRT items may also serve as valid indicators Fagerlin et al., 2007; Zikmund-Fisher et al., 2007), has of the latent construct of numeracy. been found to correlate with objective measures of numer- Although both the Schwartz-based numeracy scales acy. However, self-assessments of confidence are influ- and the CRT are predicted to be indicators of numeracy, enced by factors in addition to true ability level (Dunning, evidence suggests that these scales may differ in their Heath, & Suls, 2004), leading to potential concerns about ability to assess performance at different levels of the the validity of such assessments. Other numeracy measures latent trait. For instance, even in very numerate popula- have focused on objective performance, testing indivi- tions, such as college students from highly selective duals’ ability to make correct computations and understand universities, a substantial proportion of participants score probabilistic information. These abilities are particularly only 0 or 1 on the three-item CRT. Frederick (2005) important in understanding the risk and benefitinforma- reported that approximately one-third of this total sample tion presented in many “real-world” decision-making con- scored 0 on the CRT and another 28% answered only one texts (e.g., health and financial contexts; Burkell, 2004). question correctly. Further, the modal score of nearly half Although both methods to assess individual differences in of the sub-samples collected was 0. In contrast, median numeracy provide valuable insights, the current study focuses scores on the Lipkus et al. (2001) measure approach the on the objective performance scales that have been used in maximum range of scores (e.g., Peters, Västfjäll, Slovic, the literature. Mertz, Mazzocco, & Dickert, 2006). The skewness of each Schwartz et al. (1997) developed one of the first of these measures may limit the measure’s ability to dis- performance-based numeracy measures. The measure was criminate numeracy level in many populations and may pro- comprised of three items that included one question asses- vide a disadvantage when assessing any linear effects of sing participants’ understanding of chance (i.e., How many numeracy. heads would come up in 1000 tosses of a fair coin?) and two questions asking the participants to convert a percentage to a proportion and vice versa (i.e., the chance of winning a Associations between individual differences in numeracy car is 1 in 1000; what is the percentage of winning tickets and decision making for the lottery?). Lipkus et al. (2001) further expanded this Individual differences in numeracy have been shown to have scale by adding eight questions to the Schwartz et al. numer- important associations with judgment and decision making. acy scale; the additional items were designed to assess a par- Recent reviews of the numeracy literature have found that ticipant’s ability to understand and compare risks (e.g., compared with highly numerate individuals, those lower in Which of the following numbers represents the biggest risk numeracy are more likely to have difficulty judging risks of getting a disease: 1%, 10%, or 5%?) and to accurately and providing consistent assessments of utility, are worse at work with decimal representations, proportions, and frac- reading graphs, show larger framing effects, and are more tions. Moreover, Peters, Hibbard, Slovic, and Dieckmann sensitive to the formatting of probability information (for (2007) further expanded the Lipkus et al. numeracy scale, in- reviews, see Peters, Hibbard et al., 2007; Reyna et al., troducing four additional items in an attempt to expand the 2009). Although numeracy typically leads to better decision range of difficulty; these additional items assess the under- making, there is evidence that the increased numerical standing of base rates as well as the ability to make more processing observed in the highly numerate can lead to complex likelihood calculations. increased affective reactions to numbers, or number compar- Similarly, Frederick (2005) developed a three-item mea- isons, which, in turn, can result in optimal or sub-optimal sure, the cognitive reflection test (CRT), which includes decision making. In an optimal example, Peters et al. items that involve mathematical ability. Although the CRT (2006) asked participants to complete a ratio bias task. They was not explicitly defined as a numeracy test and only were offered a chance to win a prize by drawing a red speculation exists about the underlying dimensions of the jellybean from a bowl. When provided with two bowls from CRT, the items appear to require understanding, manipulat- which to choose, participants often elected to draw from a ing, and using numbers to solve them. Prior research has large bowl containing a greater absolute number, but smaller supported this assertion. For instance, Obrecht, Chapman, proportion, of red beans (9 in 100, 9%) rather than from a and Gelman (2009) found that the CRT was moderately small bowl with fewer red beans but a better winning proba- correlated with SAT quantitative scores (r = .45; see Toplak, bility (1 in 10, 10%) even with the probabilities stated West, & Stanovich, 2011 for similar findings). In a smaller beneath each bowl. Peters et al. (2006) found that 33% and study, Cokely and Kelley (2009) reported a significant 5% of less and more numerate adults, respectively, chose (r = .31) correlation between numeracy and CRT perfor- the larger inferior bowl. Controlling for SAT scores, the mance. Moreover, Liberali, Reyna, Furlan, Stein, and Pardo choice effect remained significant. In addition, compared (2011) reported a moderate to strong correlation (r= .51 with the highly numerate, the less numerate reported less Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 200 Journal of Behavioral Decision Making affective precision about Bowl A’s 9% chance (“How clear a standing on a latent trait or ability level and the difficulty feeling do you have about [its] goodness or badness?”); their of the test item. According to this model, the probability affect to the inferior 9% odds (“How good or bad does [it] that an individual will correctly answer an item is a logistic make you feel?”) was directionally less negative. Peters function of the difference between the individual’strait et al. (2006) concluded that affect derived from numbers level and the extent to which the trait is expressed in the and number comparisons may underlie the highly numerate’s item. Put differently, the higher a person’s ability relative greater number use (cf. the “Bets” experiment in the present to the difficulty of an item, the higher the probability of a paper’s Study 2 and Peters et al., 2006). correct response on that item. When a person’s location Frederick (2005) also found that individuals who per- on the latent trait is equal to the difficulty of the item, there formed well on his CRT were more likely to choose a is, by definition, a .5 probability of a correct response in the future reward of greater value than a smaller immediate Rasch model. Thus, for each item, Rasch analyses can reward. Further, these individuals demonstrated evidence characterize a curve that describes the ability level at which of weaker reflection effects (i.e., risk taking to avoid losses the item maximally discriminates. is greater than risk taking to achieve gains; Kahneman & Tversky, 1979), compared with individuals scoring low in cognitive reflection. High-CRT individuals also were less Overview of the present paper likely to show risk-averse preferences towards gambles In Study 1, we focused on the development of a Rasch-based when the relative expected value between choice options numeracy measure. For our item pool, we used items from favored choosing an uncertain option. Moreover, Toplak the existing scales: the Schwartz et al. (1997) three-item et al. (2011) found that greater CRT performance was measure, the Lipkus et al. (2001) expanded 11-item numer- significantly associated with an index of rational decision acy scale, further expansion of that scale by Peters, Hibbard making comprised of a collection of classic heuristics and et al. (2007), and Frederick’s (2005) CRT. In contrast to a biases tasks. typical short-form scale construction that attempts to reduce a single existing scale, our primary objective was to retain the range of difficulty shown across the scales and to develop a shorter numeracy measure (relative to the entire item pool Development of an abbreviated numeracy scale and to individual measures as possible). The former point A common problem with traditional methods of short-form will allow a broader use of the scale for populations who scale construction has been the reliance on item–total corre- show limited variability on the existing measures. To achieve lations to guide item selection for short forms (i.e., choosing these goals, we incorporated items from all four measures items with the highest item–total correlations). Using such an that encompass a greater range of difficulty than any one of approach renders the researcher unable to ascertain whether the scales. In Study 2, we confirmed the Rasch analysis the short form has removed error variance or narrowed the results on an independent sample and tested the predictive construct (Smith, McCarthy, & Anderson, 2000). In turn, validity of the scale by replicating findings that have been scales developed in this manner are often less able to fully obtained in previous studies. Additionally, we compared assess the scope of the construct in question, thus posing a the predictive validity of our scale with that of the CRT threat to predictive validity of the measure despite retaining and the Lipkus et al. measure. Finally, in Study 3, we further levels of internal consistency similar to the long form tested the predictive and comparative validity of the Rasch- (Smith & McCarthy, 1996). based numeracy scale by examining its associations with risk Alternative scaling methods can allay such concerns. likelihood judgments. Using these techniques, which can be classified as IRT-based scaling, one can develop more efficient psychological tests, in the sense that fewer items are needed to measure a latent STUDY 1 construct while concurrently maintaining the scale’s range of difficulty. Importantly, these methods largely preserve Method psychometric indices such as mean inter-item correlations Participants despite reductions in the number of items, upon which Participants were 1970 subjects collected from three sepa- calculations of coefficient a are based. rate samples. The first sample consisted of 302 community One IRT-based scaling approach was developed by members, equally divided between those with higher edu- Rasch (1960/1993) and has been successfully used to cation and those with lower education. Participants were develop shorter instruments for a wide range of constructs recruited through online and newspaper advertisements. (e.g., Cole, Kaufman, Smith, & Rabin, 2004; Hibbard, The second sample consisted of 163 undergraduates in an Mahoney, Stockard, & Tusler, 2005; Prieto, Alonso, & introductory psychology class. Finally, the third sample Lamarca, 2003; Simon, Ludman, Bauer, Unützer, & was an online study of adults using the American Life Operskalski, 2006). In a Rasch model, responses are viewed Panel (n= 1505). These three samples were merged into a as outcomes of the interaction between a test taker’s single dataset. The sample included 894 women (45.3%) and 1076 men (54.7%). The median age for this sample was 48 years For further reading regarding IRT-based approaches versus a classical test theory approach, see Lord and Novick (1968) and Embretson (1996). (range = 18–89). Highest educational level attained was as Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 201 follows: 3% of participants did not graduate from high school, Path parameters were freely estimated. Both the one-factor 16.3% received a high school diploma, 9.2% attended a and two-factor solutions showed nearly identical fit statistics vocational/trade school or community college, 31.7% had (see Table 1 for fit statistics and factor loadings). Given that completed some college (including those currently enrolled in the two-factor model does not offer an appreciably better a 4-year program), 21.5% received a bachelor’s degree, and model fit and the between-factor correlation was high 17.5% had an advanced degree. The college sample received (r = .85), the more parsimonious explanation of the data course credit for their participation, and individuals in both com- favors adopting a one-factor model. The data suggest that munity samples were financially compensated for participation. the assumption that the item pool represented a coherent, unitary construct is a tenable one; hence, Rasch-based scal- ing is appropriate. Numeracy scales All participants completed the following measures of numeracy: the 11-item Lipkus numeracy scale (Lipkus et al., 2001), which also included the three items from Schwartz et al. (1997), four Rasch analysis additional items developed by Peters, Hibbard et al. (2007), Table 2 shows the item difficulty statistics for all items (i.e., the and three CRT items (Frederick, 2005). proportion of participants correctly answering each item). On average, the Lipkus et al. numeracy items were less difficult, whereas the CRT items were more difficult. Next, we Results and discussion conducted a Rasch analysis on all numeracy and CRT items, Numeracy scales following the procedure of Hibbard et al. (2005). Initially, For the Schwartz et al. three-item scale, Cronbach’s a =.58, mean items were assessed for fit. In general, fit statistics should range inter-item r = .31. Adding the additional eight items of Lipkus from .5 to 1.5 (Linacre, 2002). One item was deleted because et al. to the Schwartz et al. scale resulted in the 11-item Lipkus of a poor outfit statistic. All other items met this criterion. To numeracy measure with Cronbach’s a = .76, mean inter-item reduce the item pool further, items were deleted sequentially r= .23. When adding the four additional items of Peters et al. on the basis of the extent to which the deletion minimally re- to the Lipkus measure, Cronbach’s a = .76, mean inter-item duced the person reliability. Person reliability is a measure of r = .19. For the CRT, Cronbach’s a=.60, mean inter-item r=.34. the ability of the scale to discriminate the sample into different In the current sample, the Peters, Hibbard et al. (2007) and levels of ability and, therefore, is a key construct in measure CRT measures were significantly correlated (r = .49). Further, development using the Rasch technique. After each item was examination of Cronbach’s a of the omnibus 18-item scale deleted, Rasch analysis was rerun to determine the decrease (a = .75) and the mean inter-item correlation (r = .19) for the in person reliability for that deletion. The item that decreased combined items provides initial evidence that the decision to person reliability the least was deleted, and the process was combine these scales was warranted. repeated. In the case of ties, items that were most similar to remaining items in difficultyweredeleted.The processwas stopped when further deletions resulted in unacceptably low Confirmatory factor analysis levels of person reliability (Hibbard et al., 2005). Because Rasch analysis assumes that the latent construct is The final scale consisted of eight items, five from the unitary in nature, the most important threat to this assumption original Lipkus et al. scale (including the three original would occur if the CRT and the items from the other numeracy Schwartz et al. items), two from the CRT scale, and one of scales represented separate factors. Such a finding would the Peters et al. items. Difficulty structure and fit statistics are suggest that the item pool that we intended to use would not shown in Table 3. Fit statistics for all items were deemed to tenably represent a coherent, unitary construct. To test whether be adequate, and person reliability was .63. Cronbach’s a for the CRT and numeracy items load on a unitary factor, we com- the eight-item scale was .71 and mean inter-item was r = .24. pared two separate confirmatory factor analysis (CFA) models: Consistent with the psychological assessment literature, which (i) a single-factor model in which all numeracy and CRT items suggests that the mean inter-item correlation is a more useful loaded on a unitary factor and (ii) a correlated two-factor model index of internal consistency, the observed mean inter-item with CRT items loading on one dimension and numeracy items correlation was acceptable for measuring a broad, higher-order loading on another factor. CFA is widely regarded in the broader psychological assessment literature to be the strongest test for unidimensionality, compared with exploratory factor analysis methods. CFAs were conducted using MPLUS version As suggested by Cortina (1993), we calculated the index of a precision es- 6.1 software. A variance-adjusted weighted least squares esti- timate that estimates the “spread” or standard error of a. Although larger mation was used to estimate dichotomous variables in CFA. values of this estimate cannot definitively state that multidimensionality is present, higher standard errors are a symptom of multidimensionality. Con- versely, an estimate = 0 would suggest unidimensionality. For the reduced From the inter-item correlation matrix, we chose to omit two items from eight-item scale, the precision estimate = .01. For comparison purposes, we these analyses. We chose to omit question 8a because of its strong redun- created a hypothetical scale with the same number of items that included dancy with item 8b (r = .78), compared with its correlation with other items. two orthogonal dimensions, maintaining a roughly equivalent a and mean in- We also conducted a CFA with item 8a instead of 8b, and these findings did ter-item correlation to that of our scale (a = .72 and r = .246). For the hypo- not appreciably differ from those reported. Further, we chose to omit ques- thetical scale, the precision estimate = .06. These findings would suggest tion 14 (SARS item) because it showed no significant associations with other that the spread of the inter-item correlations more closely resembles a unitary items in the item pool at p< .05. scale rather than a multidimensional scale. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 202 Journal of Behavioral Decision Making Table 1. Fit statistics and unstandardized and standardized coefficients for one-factor and two-factor confirmatory factor analysis solutions— Study 1 Two-factor solution One-factor solution Factor 1 Factor 2 Item number Ustd (SE) Std (SE) Ustd (SE) Std (SE) Ustd (SE) Std (SE) Q1. Imagine that we roll a fair, six-sided die 1000 times. 1.0 (.00) 1.0 (.00) .67 (.02) .64 (.02) Out of 1000 rolls, how many times do you think the die would come up as an even number? Q2. In the BIG BUCKS LOTTERY, the chances of 1.10 (.05) .70 (.02) 1.1 (.05) .70 (.02) winning a $10.00 prize are 1%. What is your best guess about how many people would win a $10.00 prize if 1000 people each buy a single ticket from BIG BUCKS? Q3. In the ACME PUBLISHING SWEEPSTAKES, the 1.17 (.05) .76 (.02) 1.18 (.05) .77 (.02) chance of winning a car is 1 in 1000. What percent of tickets of ACME PUBLISHING SWEEPSTAKES win a car? Q4. Which of the following numbers represents the biggest 1.13 (.06) .73 (.03) 1.12 (.06) .73 (.03) risk of getting a disease? (1 in 100, 1 in 1000, or 1 in 10) Q5. Which of the following numbers represents the biggest 1.07 (.06) .69 (.03) 1.07 (.06) .69 (.03) risk of getting a disease? (1%, 10%, or 5%) Q6. If Person A’s risk of getting a disease is 1% in 10 years, 1.16 (.05) .75 (.02) 1.17 (.05) .76 (.02) and Person B’s risk is double that of A’s, what is B’s risk? Q7. If Person A’s chance of getting a disease is 1 in 100 1.11 (.05) .72 (.02) 1.12 (.05) .72 (.02) in 10 years, and person B’s risk is double that of A, what is B’s risk? Q8b. Out of 1000? .92 (.06) .60 (.03) .92 (.06) .60 (.03) Q9. If the chance of getting a disease is 20 out of 100, this 1.03 (.05) .67 (.03) 1.03 (.05) .67 (.03) would be the same as having a _____% chance of getting the disease. Q10. The chance of getting a viral infection is .0005. Out .77 (.05) .49 (.03) .77 (.05) .50 (.03) of 10 000 people, about how many of them are expected to get infected? Q11. Which of the following numbers represents the 1.14 (.07) .74 (.04) 1.14 (.07) .74 (.04) biggest risk of getting a disease? (1 in 12 or 1 in 37) Q12. Suppose you have a close friend who has a lump in .74 (.07) .48 (.04) .74 (.07) .48 (.04) her breast and must have a mammography .. . The table below summarizes all of this information. Imagine that your friend tests positive (as if she had a tumor), what is the likelihood that she actually has a tumor? Q13. Imagine that you are taking a class and your chances 1.05 (.05) .67 (.02) 1.05 (.05) .68 (.02) of being asked a question in class are 1% during the first week of class and double each week thereafter (i.e., you would have a 2% chance in Week 2, a 4% chance in Week 3, an 8% chance in Week 4). What is the probability that you will be asked a question in class during Week 7? Q15 (CRT). A bat and a ball cost $1.10 in total. The bat costs 1.20 (.05) .77 (.02) 1.16 (.05) .85 (.02) $1.00 more than the ball. How much does the ball cost? Q16 (CRT). If it takes five machines 5 minutes to make 1.06 (.05) .68 (.02) 1.0 (.00) .74 (.03) five widgets, how long would it take 100 machines to make 100 widgets? Q17 (CRT). In a lake, there is a patch of lily pads. Every .91 (.05) .58 (.03) .87 (.05) .64 (.03) day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake? Fit statistics X /df 9.980 9.628 CFI .912 .917 TLI .900 .903 RMSEA .068 .066 Note. Standard errors are reported in parentheses. CFI, comparative fit index; RMSEA, root mean square error of approximation; SE, standard error; TLI, Tucker–Lewis index. construct (Briggs & Cheek, 1986; Clark & Watson, 1995). Descriptive statistics. Figure 1 shows frequency distribu- Combined with the CFA results, these results suggest that the tions for the separate measures used: the Lipkus et al. Rasch-based numeracy scale measures the construct in a coher- measure (Panel A), Frederick’s CRT (Panel B), the Peters ent, unitary, and internally consistent manner. et al. measure (Panel C), and the Rasch-modeled scale Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 203 Table 2. Item difficulties for individual items—Study 1 Item Item difficulty Q11. Which of the following numbers represents the biggest risk of getting a disease? (1 in 12 or 1 in 37) 96.1 Q5. Which of the following numbers represents the biggest risk of getting a disease? (1%, 10%, or 5%) 94.5 Q4. Which of the following numbers represents the biggest risk of getting a disease? (1 in 100, 1 in 1000, or 1 in 10) 92.7 Q8a. If the chance of getting a disease is 10%, how many people would be expected to get the disease? Out of 100? 91.2 Q8b. Out of 1000? 88.1 Q9. If the chance of getting a disease is 20 out of 100, this would be the same as having a _____% chance of getting the 84.3 disease. Q1. Imagine that we roll a fair, six-sided die 1000 times. Out of 1000 rolls, how many times do you think the die would come 74.9 up as an even number? Q13. Imagine that you are taking a class and your chances of being asked a question in class are 1% during the first week of 74.3 class and double each week thereafter (i.e., you would have a 2% chance in Week 2, a 4% chance in Week 3, an 8% chance in Week 4). What is the probability that you will be asked a question in class during Week 7? Q6. If Person A’s risk of getting a disease is 1% in 10 years, and Person B’s risk is double that of A’s, what is B’s risk? 71.2 Q2. In the BIG BUCKS LOTTERY, the chances of winning a $10.00 prize are 1%. What is your best guess about how many 70.6 people would win a $10.00 prize if 1000 people each buy a single ticket from BIG BUCKS? Q10. The chance of getting a viral infection is .0005. Out of 10 000 people, about how many of them are expected to get 58.4 infected? Q7. If Person A’s chance of getting a disease is 1 in 100 in 10 years, and person B’s risk is double that of A, what is B’s risk? 55.3 Q14. Suppose that 1 out of every 10 000 doctors in a certain region is infected with the SARS virus; in the same region, 20 out 52.8 of every 100 people in a particular at-risk population also are infected with the virus. A test for the virus gives a positive result in 99% of those who are infected and in 1% of those who are not infected. A randomly selected doctor and a randomly selected person in the at-risk population in this region both test positive for the disease. Who is more likely to actually have the disease? Q3. In the ACME PUBLISHING SWEEPSTAKES, the chance of winning a car is 1 in 1000. What percent of tickets of 34.5 ACME PUBLISHING SWEEPSTAKES win a car? Q16 (CRT). If it takes five machines 5 minutes to make five widgets, how long would it take 100 machines to make 100 32.3 widgets? Q17 (CRT). In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to 31.9 cover the entire lake, how long would it take for the patch to cover half of the lake? Q15 (CRT). A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? 18.7 Q12. Suppose you have a close friend who has a lump in her breast and must have a mammography .. . The table below 9.8 summarizes all of this information. Imagine that your friend tests positive (as if she had a tumor), what is the likelihood that she actually has a tumor? Table 3. Difficulty structure and fit statistics for the eight-item Associations between Rasch-based numeracy scale and numeracy scale—Study 1 demographic variables. Somewhat surprisingly, we found no significant negative correlation between age and numer- Item Difficulty Infit Outfit acy (r=.02, ns). With respect to gender, we found that Q12 89.0 1.10 .90 men performed better than women (point biserial r = .28, CRT1 73.5 .95 .72 p< .001). We also investigated how educational level was CRT3 60.2 .87 .75 associated with numeracy performance. As shown in Table 5, Q3 57.9 .84 .76 Q2 39.6 1.24 1.61 we observed that a disproportionate number of individuals Q1 29.8 .90 .77 with a high school/trade school or less educational level Q9 26.2 1.02 1.16 (low education group) scored 0 on the CRT (64%). In fact, Q8b 15.2 1.05 .79 even among those with a bachelor’s degree or greater (high-education group), the modal response was still 0. In (Panel D). Table 4 shows the descriptive statistics for contrast, we observed that the Lipkus et al. measure showed each scale. As expected, the CRT was positively skewed, a greater negative skew as a function of participants’ educa- whereas the Peters et al. measure and especially the tional level. Nearly 69% of all individuals scored 9 or higher Lipkus et al. measure were negatively skewed. These on the Lipkus et al. measure. The Rasch-based measure, in findings suggest that both the Lipkus et al. measure and comparison, maintained a relatively normal distribution the CRT do not adhere to a normal distribution. On the across different educational levels. For this scale, the major- contrary, performance scores for the Rasch-based numer- ity of respondents scored in the middle of the distribution, acy scale were roughly normally distributed (M = 4.12, with predictably more individuals in the lower- SD = 1.87, median = 4, mode = 4), and the distribution education group scoring worse on the scale, whereas in the was not significantly skewed (.07, z = 0.11, ns). Taken higher-education group, more individuals scored towards together, these results strongly suggest that the CRT, the higher end of the distribution. To further examine these Lipkus et al., and Peters et al. scales, taken separately, may educational level differences with the Rasch-based numeracy be too difficult or too easy, which may limit the sensitivity of measure, we conducted a one-way analysis of variance for the test to accurately detect an individual’s true ability level educational level (three levels: high school/trade school on the latent construct. education or less, some college, and 4-year college graduate Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 204 Journal of Behavioral Decision Making Figure 1. Frequency distributions of individual scales—Study 1. We present the frequency distributions of the cognitive reflection task (CRT; Panel a), the Lipkus et al. numeracy measure (Lipkus; Panel b), the Peters et al. numeracy measure (Peters; Panel c), and the new reduced Rasch-derived model developed in the current study (“Rasch”; Panel d). Table 4. Descriptive statistics for numeracy measures—Study 1 Table 5. Distribution of correct answers for the CRT, Schwartz et al., Lipkus et al., and Rasch-based measures as a function of Scale Mean (SD) Median Mode Skewness educational level—Study 1 CRT (three items) 0.83 (.99) 0 0 .88 Educational level Schwartz et al. 1.8 (1.01) 2 2 –.36 Scale (three items) score High school/trade Some college College grad Lipkus et al. (11 items) 8.15 (2.36) 9 10 –.94 Cognitive reflection test Peters et al. (15 items) 10.48 (2.81) 11 12 –.98 0 64.1 55.8 36.1 Rasch-based 4.13 (1.87) 4 4 .00 1 22.3 24.8 25.5 (eight items) 2 9.5 13.7 23.1 3 4.2 5.8 15.3 Schwartz et al. 0 22.6 12.4 4.1 1 32.3 27.7 17.1 or greater). As expected, we found a significant main effect 2 29.7 36.5 35.9 for educational level (F(2, 1965) = 169.20, p< .001). Those 3 15.4 23.5 43.0 holding a college degree or greater performed best on the Lipkus et al. Rasch-based numeracy measure (M= 4.90, compared with 0–4 16.9 7.7 2.8 4.02 and 3.06 for the some college and high school/trade 5–8 53.0 47.4 28.4 9–11 37.7 44.9 68.8 school or less education groups, respectively). Peters et al. 0–4 6.5 3.1 0.8 5–8 35.9 21.2 7.7 9–12 46.0 57.3 51.0 Convergent validity 13–15 11.6 18.3 40.4 Participants from the community sub-sample also completed the Rasch-based 0–2 37.4 20.8 8.5 Fagerlin et al. (2007) eight-item SNS (a = .86). As expected, we 3–5 50.8 61.2 53.1 found that the Rasch-based numeracy measure was significantly 6–8 11.8 17.9 38.4 correlated with individuals’ subjective perceptions of numeracy Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 205 (r=.55, p< .001). This correlation did not differ from the education, 26% had completed high school or a trade school, Lipkus et al. measure (r = .55) or the Peters et al. 15-item mea- 57% had completed some college or had a college degree, sure (r = .57). It was stronger than both the Schwartz et al. and 14% had completed schooling beyond a 4-year degree. three-item measure (r = .44) and the CRT (r = .43). The Decision Research Web panel participants are compen- Taken together, these results indicate that the Rasch-based sated $15 per hour (prorated). measure was able to reduce the item pool from 18 to eight items, while maintaining the psychometric qualities of the larger item pool and the composite scales. Additionally, we Decision-making tasks found evidence of convergent validity and largely replicated Ratio bias task. As explained earlier, in the ratio bias task previously reported correlations with demographic variables. (Denes-Raj & Epstein, 1994), participants are offered a chance to win a prize by drawing a red jellybean from one of two bowls. One bowl has a greater absolute number of STUDY 2 red beans (i.e., 9 in 100), and the other bowl has a smaller absolute number but a greater proportion of red beans (i.e., Overview 1 in 10). Peters et al. (2006) predicted and found that less The purpose of Study 2 was both to confirm the Rasch results numerate adults drew more often from the affectively appeal- from Study 1 on an independent sample and to test the ing bowl with less favorable objective probabilities whereas predictive validity of the eight-item Rasch-based numeracy the highly numerate drew more often from the objectively scale. We tested performance on three decision-making para- better bowl. Participants responded on a 13-point bipolar digms that previously have been associated with individual scale (1 = strongly prefer 9% bowl,7= no preference, differences in numeracy (Peters et al., 2006). Specifically, 13 = strongly prefer 10% bowl). We predicted that the new we tested whether performance on the Rasch-based numer- measure would also replicate the findings of Peters et al. acy measure predicted the following: (i) the extent of framing effects; (ii) how individuals rated the attractiveness of bets in “Bets” task. Peters et al. (2006) concluded from the ratio bias a “less is more” effect paradigm (Slovic, Finucane, Peters, & task discussed earlier that an affective process may underlie MacGregor, 2002); and (iii) the extent of denominator the greater number use of numbers by the highly numerate. neglect in a ratio bias task. We also compared the predictive If correct, then highly numerate individuals (who are thought validity of the Rasch-based scale with that of two of the to be more likely to draw affective meaning from number component measures, namely the CRT and the Lipkus et al. comparisons) may sometimes overuse numbers and respond measures. One well-established criteria of successful short- less rationally than the less numerate. As a replication of form development is that an abbreviated measure should the work of Peters et al. (2006), the bets task was conducted not result in significant decrements to validity (Smith et al., in a between-subjects design. One group of participants rated 2000). By definition, short-form development attempts to the attractiveness of a no-loss gamble (7/36 chances to win reduce a construct that prior researchers concluded required $9; otherwise, win $0) on a 0–20 scale; a second group rated a more lengthy assessment. If a full-length scale contains a similar gamble with a small loss (7/36 chances to win $9; much irrelevant or invalid content, then one could expect that otherwise lose 5¢). Peters et al. (2006) hypothesized and the validity of the short-form scale would increase. However, found that highly numerate participants rated the objectively if the items contained in full-length assessment are largely worse bet as more attractive and reported more precise affect valid, then one would expect that a short-form measure and more positive affect to the $9 in the loss’ presence. Thus, would result in reduced validity. In this sense, the Rasch- although greater numeracy is generally thought to lead to based measure does not necessarily have to demonstrate better decisions when numeric information is involved, it increased validity compared with other assessments but appears associated sometimes with an overuse of number should, at least, show comparable validity with that observed comparisons, which may subsequently lead to sub-optimal with the other measures. judgments despite higher ability levels. These results were consistent with the highly numerate accessing a richer affective “gist” from numbers (Reyna et al., 2009). Method Thus, we predicted a significant bet condition numeracy Participants interaction. The sample consisted of 899 participants who consented to be part of an ongoing opt-in Web panel administered by Decision Research. The panel members are 65% women Framing. Participants were presented with the exam scores and have a mean age of 38.7 years. Two percent had less than and course levels (200, 300, or 400—indicating varying a high school education, 33% had completed high school or a difficulty levels of classes) of three psychology students trade school, 53% had completed some college or had a and were asked to rate the quality of each student’s work college degree, and 13% had completed schooling beyond on a 7-point scale (3= very poor to +3 = very good). Fram- a 4-year degree. A subset of this sample (n = 723, 70% ing was manipulated between subjects as percent correct or women) was used for testing the predictive validity of the percent incorrect so that “Paul,” for example, was described Rasch-based measure. The mean age of the sample was as receiving either 74% correct on his exam or 26% incor- 39.5 years. One percent had less than a high school rect. Consistent with prior research (Peters et al., 2006), we Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 206 Journal of Behavioral Decision Making predicted the difference in ratings for the positive versus Predictive validity negative frames would be greatest among less numerate We report the analyses based on the Rasch-derived measure participants. Put differently, participants lower in numeracy in the following sections and then discuss the issue of were expected to show more pronounced framing effects comparative validity. than those higher in numeracy. More numerate individuals Ratio bias task. We also replicated the findings from the ratio were expected to transform the provided frame into the bias task (Peters et al., 2006). Consistent with Peters et al. normatively equivalent alternative frame so that they would (2006), more numerate participants had a stronger preference have both frames of information available (Cokely & Kelley, for the objectively better bowl (10% bowl) than those lower in 2009; Peters et al., 2006). numeracy (r(218) = 0.16, p< .01). This result also is consistent All participants completed the framing and bets decision with Stanovich and West’s (2008) finding that cognitive ability tasks. A subset (n= 218) also completed the ratio bias task. was significantly associated with a similar ratio bias problem. Bets task. We regressed the rated attractiveness of the gamble condition (coded 1= no loss,1= small loss), the individual Results and discussion differences in numeracy (mean deviated), and the interaction Rasch analysis between numeracy level and condition. Consistent with prior Rasch analysis was conducted in the same manner as in research (Bateman, Dent, Peters, Slovic, & Starmer, 2007; Study 1. We found that the results matched those obtained Slovic et al., 2002; Stanovich & West, 2008), participants in Study 1, both in terms of the items retained as well as their rated the gamble as more attractive in the small loss condi- relative difficulties (Table 6). Person reliability for this scale tion (F(1, 719) = 60.40, p< .001, b = .92). Participants higher was .65, and Cronbach’s a was .71; the mean inter-item in numeracy also rated the gamble to be more attractive over- correlation for the retained items was .24. all than those lower in numeracy (F(1, 719) = 16.11, p< .001, b = .26). Replicating Peters et al. (2006), the hy- pothesized interaction was also significant, such that partici- Descriptive statistics pants higher in numeracy were more strongly affected by the As predicted, the scores for the Rasch-based numeracy scale small loss in the task (F(1, 719) = 6.50, p< .01, b = .17). were roughly normally distributed (M = 4.07, SD = 1.83, median = 4, mode = 4), and the distribution was not signifi- Framing task. We regressed the average rated student’swork cantly skewed (.07, z = 0.83). quality on frame condition (coded 1= negative,1= positive), numeracy (mean deviated), and a frame numeracy interac- Associations between Rasch-based numeracy and demo- tion. Subjects who did not respond to all stimuli (n= 29) were graphic variables. We found the expected negative correlation excluded from the analyses. As expected, we replicated the between age and numeracy (r=.17, p< .001). Additionally, findings from the framing task reported earlier (Peters et al., we found that men performed better than women (point biserial 2006). We found a significant effect for frame (F(1, r = .31, p< .001). Moreover, we conducted a one-way analysis 690) = 245.07, p< .001, b = .48) and additionally found a of variance to determine differences in numeracy as a function significant main effect for numeracy (F(1, 690) = 4.59, p< .05, of educational level and the association between educational b =.04). Most importantly, we found a significant frame level (three levels: high school/trade school education or less, numeracy interaction (F(1, 690) = 8.34, p< .001, b =.05), some college, and 4-year college graduate or greater). As in which less numerate participants showed larger framing expected, we found a significant main effect for educational effects. These findings replicate the work of Peters et al. level (F(2, 720) = 35.57, p< .001), in that those with a 4-year (2006) and, moreover, are consistent with research suggesting college degree or greater performed better on the Rasch-based that less numerate decision makers focus on non-numeric numeracy measure (M= 4.60, compared with 4.11 and 3.27 for sources of information when constructing preferences the some college and high school/trade school or less education (Dieckmann, Slovic, & Peters, 2009; Peters, Dieckmann, groups, respectively). Overall, these findings replicate the Västfjäll, Mertz, Slovic, & Hibbard, 2009). results reported in Study 1. Table 6. Difficulty structure and fit statistics for the Rasch-based Comparative validity numeracy scale—Study 2 Table 7 shows the results for the three behavioral tasks as a function of different numeracy assessment. Overall, the Difficulty Infit Outfit Rasch-based scale demonstrates comparable validity with Q12 90.5 1.24 .74 that observed with the Lipkus et al. and CRT scales. For CRT1 76.6 .96 .94 the ratio bias task, the Rasch-based measure was more CRT3 60.5 .91 .67 strongly associated with preference for the normatively Q3 54.2 .84 .80 Q2 30.5 .96 .84 correct bowl than the CRT; associations of the Rasch-based Q1 29.7 1.07 1.27 scale and the longer Lipkus et al. scale were about the same. Q9 17.4 .91 .62 For the bets task, we found that the numeracy bet condi- Q8b 14.2 1.07 .98 tion interaction was significant using all three numeracy Note. Higher difficulty scores indicate greater difficulty. measures. To test the extent to which this effect was stronger Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 207 Table 7. Comparative validity analyses regressing decision performance on numeracy scales—Study 2 Ratio bias task Bets task Framing task 2 2 Pearson r Bets condition Numeracy scale Interaction R Framing condition Numeracy scale Interaction R CRT .11 0.91** 0.57** 0.24* .11 0.47** 0.03 0.1** .27 Lipkus et al. .14 0.92** 0.15 0.11* .09 0.47** 0.03* 0.02 .26 Rasch-based .16 0.92** 0.26** 0.16** .10 0.48** 0.04* 0.05** .27 Note. CRT, cognitive reflection test. *p < .05, **p< .01. Each row reflects a separate regression analysis. Unstandardized coefficients and effect sizes are shown for each independent variable. for the Rasch-based numeracy measure, we calculated and interactions, but it showed roughly equal predictive validity compared the effect size estimates for the differences for the ratio bias task. Compared with the CRT, the Rasch- between bet conditions (i.e., bets effect) as a function of both based measure showed stronger effects with respect to the numeracy level (i.e., either high or low numeracy) and bets task and the ratio bias test but only showed modest specific numeracy measure. Essentially, these analyses com- effect size differences for the framing task. It is possible that pare the simple effects of the interaction in terms of a linear the use of a more general population, not to mention one contrast for numeracy, as construed by the different mea- collected over the Internet, dampened expected relationships sures. For the Rasch-based measure, the effect size of the between numeracy and decision effects, thus reducing the bets effect for those scoring highest in numeracy (seven to chances of finding stronger scale-based differences. For eight items correct; d = 1.06) was nearly four times as large example, the materials in the framing task were originally as the effect size observed for those scoring lowest on the developed to be meaningful to the undergraduate population Rasch-based numeracy measure (zero to two items correct; tested by Peters et al. (2006), but the course level information d = .27). Similarly, those who scored 0 on the CRT showed (which was provided without further explanation) may have weaker effect sizes (d = .42) than those who answered all been confusing for the more general population studied here. three CRT items correctly (d = .70). We also observed a Second, although Internet data collection is a valid means of stronger effect size for those scoring highest on the Lipkus obtaining psychological data, data from Internet samples are et al. measure (9–11 items correct, d = .65) than for indivi- often noisier because of the lack of environmental control duals scoring the lowest on numeracy (zero to four items (Gosling, Vazire, Srivastava, & John, 2004). correct, d = .06). Thus, although we found the significant Although these results are encouraging, they may raise a predicted interaction effect for all three scales, these results potential question regarding the advantages of the Rasch- suggest that these effects were strongest when assessed with based numeracy scale. As we have demonstrated in the past the Rasch-based numeracy scale. two studies, the primary advantage of the Rasch-based scale For the framing task, we observed interaction effects for is that it offers a normal distribution in the general popula- both the CRT and Rasch-based measures, but not for the tion, compared with the Lipkus et al. measure and the Lipkus et al. measure. To explore these interaction effects CRT, both of which are significantly skewed. Because in greater depth, we again calculated and compared effect skewness can attenuate linear associations between variables, size estimates of framing effects for high and low scorers we predicted that the Rasch-based scale would be a stronger on the CRT and Rasch-based measures. Individuals who linear predictor than either of the component scales. Our scored lowest on the Rasch-based measure showed very Study 2 results suggest that this will not always be the case. strong framing effects (d = 1.42) even more so than those In Study 3, we examine this issue further within the context scoring 0 on the CRT (d = 1.33). In contrast, we found that of risk perception. individuals scoring highest on the CRT showed about the same framing effects (d = .67) as did those scoring the highest on the Rasch-based numeracy scale (d= .65). Thus, STUDY 3 compared with results of the CRT, these results provide evidence that using the Rasch-based measure showed a slight In this study, we wanted to further explore the comparative advantage over the CRT when predicting framing effects for predictive validity of the Rasch-based scale using two addi- the less numerate, which was in the predicted direction of the tional tasks. We turned our attention to understanding how interaction. numeracy may predict perceived risks. Recent work has Together, these results provide evidence that the Rasch- demonstrated that numeracy is related to likelihood and risk based numeracy scale shows comparable validity with both perceptions. For instance, when presented with numerical the Lipkus et al. measure and the CRT. The Rasch-based probability information, less numerate participants tend to measure showed better distributional qualities than the CRT think that negative low-probability events are more likely or the Lipkus et al. measure and also demonstrated some to occur, compared with more numerate participants (e.g., evidence for stronger predictive validity than these existing Dieckmann et al., 2009; also Lipkus, Peters, Kimmick, measures. However, we acknowledge that this evidence is Liotcheva, & Marcom, 2010). This typical finding may be somewhat mixed. Compared with the Lipkus et al. measure, due to the less numerate responding more to non-numeric we found the Rasch-based measure to show stronger simple and often emotional information about risks such as cancer effects when we decomposed the framing and bets task (Peters, 2012; Reyna et al., 2009). Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 208 Journal of Behavioral Decision Making For this study, we examined whether the Rasch-based to examine the associations between the different numeracy measure would be a stronger linear predictor for outcomes scales and likelihood perceptions in the two scenarios, we related to the explicit understanding and use of probabilistic do not report the effect of the within-subject condition but estimation than is afforded by either the CRT or the Lipkus instead focus on the correlational analyses for this study. et al. 11-item measure. The association between understand- ing risk information and numeracy appears to be a very robust phenomenon (see Reyna et al., 2009, for a review). Understanding how numeracy is associated with risk percep- Results and discussion tions is important in many domains, including financial and Table 8 shows the correlations between perceived likelihood health decisions. For instance, if individuals lower in numer- and the three different numeracy scales for the full sample, acy misinterpret the risks of treatment options, they may act and for the lower-education (vocational school or less) and in a suboptimal way. Similarly, being able to accurately iden- higher-education (some college or more) groups. For both tify true numeracy abilities may enable risk communicators scenarios, the full-sample correlations were higher with the to develop more customized and effective communication Rasch-based measure, although each of the numeracy scales messages. is significantly negatively correlated with perceived likeli- hood, as expected. However, we anticipated the primary benefit of the Rasch-based measure to be in identifying linear Method effects across a range of educational levels. In particular, Participants given the difficulty of the CRT, we expected attenuated cor- The sample (N = 165) was drawn from the Decision Research relations in the lower-education group. As predicted, the Web Panel and was 57.6% women (mean age = 39.53 years). results demonstrate that the CRT showed the smallest corre- Approximately 25% of the sample had a high school educa- lations across both scenarios, with the Lipkus et al. and tion or less, 4% had some vocational training, 28% had Rasch-based measures showing comparable effect sizes. In attended some college, 33% were college graduates, and the higher-education group, all of the numeracy scales were 10% had attended graduate or professional school after inversely correlated with risk perceptions; the Rasch-based college. measure shows the largest effect size for both scenarios. As expected, the Rasch-based measure showed the strongest and most consistent effects in the full sample and Procedure across the two education groups. The CRT consistently In a previous session, participants completed the CRT, the demonstrated low correlations in the lower-education group. Lipkus et al. numeracy measure, and the additional items Moreover, both the Lipkus et al. measure and the CRT from the Peters et al. measure. Participants each read two showed lower correlations with risk perceptions than did different scenarios that included a narrative discussion of the Rasch-based measure in the higher-education group. available evidence relating to an event as well as a numerical Study 3 demonstrates some distinct advantages of the new probability assessment made by an expert. The first scenario Rasch-based measure. First, the Rasch-based measure described a potential terrorist attack, and the second scenario demonstrates the most consistent level of correlations across described the possible extinction of salmon in a Pacific various educational levels. We attribute this advantage to the Northwest river. The likelihood of each event was presented fact that performance on the Rasch-based measure is as either 5% or 20%. Each participant read both scenarios normally distributed in the general population. Second, and (their order was counterbalanced across subjects), and the perhaps more importantly, the Rasch-based measure overall numerical probability attached to each scenario was counter- shows stronger predictive validity in these judgments and balanced separately across subjects (i.e., numerical probabil- decisions, compared with the other two measures. Compared ity was a within-subject manipulation). After reading each with the Rasch-based measure, the CRT showed limited scenario, participants reported their own perceptions of the predictive validity, especially in the lower-education sample. likelihood of the attack or salmon extinction on a scale rang- In contrast, the Lipkus et al. measure showed evidence of ing from 0% to 100%. Because the goal of this analysis was reduced predictive validity in higher-education samples. Table 8. Correlations between risk perceptions and numeracy in the full sample and as a function of educational level—Study 3 Full sample Lower-education group Higher-education group Terrorist attacks CRT (three items) –.24** –.13* –.21* Lipkus et al. (11 items) –.34** –.34* –.29** Rasch (eight items) –.41** –.38** –.36** Salmon extinction CRT (three items) –.35** –.11 –.33** Lipkus et al. (11 items) –.38** –.31* –.35** Rasch-based (eight items) –.44** –.27 –.43** Note. CRT, cognitive reflection test. p< .10, *p < .05, **p< .01. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 209 GENERAL DISCUSSION objective assessments of numeracy (Fagerlin et al., 2007; Zikmund-Fisher et al., 2007; although see Reyna et al., A growing body of research has demonstrated that individual 2009, p. 955, for an excellent discussion regarding concerns differences in numeracy are associated with how individuals about the accuracy of individual’s subjective assessments of perceive risks, understand charts and graphs, and ultimately their own numeracy). Because the SNS was administered af- make decisions. However, measurement of this construct ter the objective numeracy measures, we cannot rule out the has varied. To our knowledge, this study is the first to present possibility that individuals reflected on the perceived ease/ the psychometric properties of several popular numeracy difficulty of the numeracy items, which, in turn, may have in- measures across a diverse sample in terms of age and educa- flated the correlation between numeracy and SNS. However, tional level (although see Liberali et al., 2011 for a similar our results are consistent with those reported by Fagerlin examination with Brazilian and US college-age samples, et al. (2007), who had subjects complete the SNS first. Fi- which adds to the literature from a cross-cultural perspec- nally, our data cannot directly speak to any differences in tive). Inspection of the distributional characteristics of these predictive power between objective and subjective numeracy measures demonstrates that the previously used measures scales, but we believe that this is an important question that are very skewed, which may limit their ability to discriminate future research should address. an individual’s trait level of numeracy. In general, the CRT We acknowledge that this scale may not include a appears to be very difficult, whereas the Lipkus et al. complete range of difficulty. Because of our study’sde- (2001) measure appears to be too easy for most individuals, sign, our results are limited by the number of items that leading to non-normal score distributions, an issue that prior were included in the initial item pool. In fact, examination research has largely addressed by using median splits or of the Rasch-based item difficulties would suggest that extreme group designs. We do not mean to either diminish more items could be added to more finely differentiate or criticize the contributions that have been made using individuals’ numeracy ability. Cokely et al. (2012), for in- these scales. In fact, these studies reinforce past research stance, applied a decision tree approach to develop a efforts supporting and strengthening the validity of extant computer-adaptive test for the highly numerate. Future re- measures. search using IRT principles can help to create adaptive In the current study, we used Rasch analysis to develop a tests that may assess numeracy across a wider range of scale that offers researchers an alternative means to assess ability levels. individual differences in numeracy, compared with classic Another implication of only using existing measures is test theory approaches (Embretson, 1996). The items that it restricts our ability to conduct a more extensive analy- retrieved, as well as the relative difficulty scaling of these sis of potential multidimensionality of the numeracy con- items, were identical across two large independent samples struct (Liberali et al., 2011). If we had started with a much of individuals ranging from 18 to 89 years of age. Moreover, larger initial item pool, it might be reasonable to expect the Rasch-based numeracy scale retained a wide range of multiple correlated facets of numeracy to be extracted that item difficulties. Further, we found that this scale approached would represent sub-competencies of numeracy. Although a normal distribution in both samples, which we believe will previous research has typically added items on the basis ultimately lead researchers to treat numeracy as a continuous of their face validity, we recommend that future scale variable rather than as a dichotomous variable. We feel that construction efforts be based instead on accepted scale con- this is an important contribution, given the potential limita- struction guidelines widely used in the assessment literature tions involved with dichotomizing variables (MacCallum, (e.g., Clark & Watson, 1995). This process begins with the Zhang, Preacher, & Rucker, 2002). generation of an item pool based on theoretical considera- Cronbach and Meehl’s (1955) classic article first identi- tions, such as those discussed in literature reviews and fied construct validity (i.e., how trustworthy is the score empirical inquiries (see Dehaene, 1997, and Reyna et al., and its interpretation) as the most important form of validity 2009, for influential reviews). Briefly, researchers should de- in psychological tests. Construct validity of a measure should velop an over-inclusive item pool of various items and diffi- be treated as a continual process that involves researchers culty levels. Numeracy skills range from, but are not limited testing the predictive validity of the measure, as well as to, simple mathematical operations (e.g., addition, multipli- assessing convergent and discriminant validity. The Rasch- cation) to logic and quantitative reasoning, as well as com- based measure demonstrates predictive validity comparable prehension of probabilities, proportions, and fractions. From with that obtained in previous numeracy studies. In fact, this item pool, researchers would subsequently conduct mul- when directly comparing the Rasch-based scale with its pre- tiple administrations of the items, refining the measure by decessors, we found that the Rasch-based measure predicted removing ambiguous/poorly constructed and misfit items as well as or better than the CRT and the Lipkus et al. along the way. Scale development in this manner can result measure across two separate studies. in the ability to make more fine-grained distinctions in We also found that the Rasch-based numeracy measure numeracy across persons and to more extensively identify was strongly correlated with the SNS of Fagerlin et al. sub-competencies/facets of numeracy. From there, research- (2007), supporting the convergent validity of the measure. ers will be able to better test if certain sub-competencies of Although the SNS was not intended to be a substitute for numeracy are differentially important to particular types of assessing precise numeracy abilities, this finding reinforces judgment and decision problems. Understanding the multiple prior research supporting a link between subjective and potential facets of numeracy is an important and necessary Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 210 Journal of Behavioral Decision Making future research direction that would be most properly from 18 to eight items, creating a measure that is compa- examined within the context of the scale construction/factor rable, or even better, in terms of predictive validity and analytic methods that we have outlined. internal consistency with that which would have been However, we offer one important caveat with respect obtained by administering either all 18 items or one of to the assessment of multidimensionality. As a conse- the component scales. quence of adequately developing measures that assess As the study of numeracy in the decision-making litera- numeracy sub-competencies in the manner that we have ture continues to grow, the importance of being able to outlined, this method would add many more items to a appropriately discriminate individual differences in numer- numeracy scale. It would especially be the case if one acy also increases. The current study offers a measure that wanted to adequately scale item difficulty and ability researchers interested in the associations between numeracy levels for each sub-competency. At the expense of being and human decision processes can use to assess individual more comprehensive, it would undoubtedly add more differences across a wider range of target populations time to assessments than even the longest numeracy mea- compared with previous measures. sure that currently exists. Thus, researchers who may have limited assessment time or resources available (e.g., researchers interested in assessing numeracy in large ACKNOWLEDGEMENTS nationally representative surveys) may opt for a shorter instrument, sacrificing construct fidelity for a broader The authors would like to gratefully acknowledge support bandwidth. We stress that it is vital for researchers to from the National Science Foundation, grant numbers have both types of measures in their assessment arsenal; SES-0820197 and SES-0517770 to Dr. Peters, SES- ultimately, though, the use of each is dependent on the 0901036 to Dr. Burns, SES-0925008 to Dr. Dieckmann, inquiry at hand. and SES-082058 to Dr. Weller. Data collection for Study We believe that our Rasch-based measure provides a 2 was supported by the National Institute on Aging, grant valuable advance in the assessment of numeracy. Our numbers R01AG20717 and P30AG024962. All views results reinforce that our reduced-item scale measures expressed in this paper are those of the authors alone. numeracy in a coherent, unitary manner, across a wide range of ability levels. Of particular interest, we used CFA to directly test whether the CRT and the numeracy REFERENCES items comprised different underlying factors. We did not find this to be the case. At the surface, these results Bateman, I. A., Dent, S., Peters, E., Slovic, P., & Starmer, C. appear to be in contrast with those reported by Liberali (2007). The affect heuristic and attractiveness of simple et al. (2011), who, across two samples, concluded that gambles. Journal of Behavioral Decision Making, 20,365–380. items from the scales of Lipkus et al. (2001) and Frederick DOI: 10.1002/bdm.558 (2005) produced four to five factors based on exploratory Briggs, S. R., & Cheek, J. M. (1986). The role of factor analysis in the development and evaluation of personality scales. Journal of factor analysis. Moreover, in one of their two studies, the Personality, 54, 106–148. CRT and objective numeracy items loaded onto different Burkell, J. (2004). What are the chances? Evaluating risk and factors. Because the single-factor un-rotated solutions, a direct benefit information in consumer health materials. Journal of measure of the common construct defined by the item pool, the Medical Library Association, 92, 200–208. were not reported, we cannot directly compare results of the Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in scale development. Psychological Assessment, 7, current study with those of Liberali et al. (2011). However, 309–319. given that reported correlations between the CRT and the Cohen, J. (1992). A power primer. Psychological Bulletin, 112, Lipkus et al. numeracy measure by Liberali et al. (2011) were 155–159. DOI: 10.1037/0033-2909.112.1.155 indicative of a moderate to large effect size (range = .40–.51; Cokely, E. T., Galesic, M., Schulz, E., Ghazal, S., & Garcia-Retamero, R. mean r = .45), it seems reasonable that a one-factor solution (2012). Measuring risk literacy: The Berlin Numeracy Test. Judgment and Decision Making, 7,25–47. may also have been observed in confirmatory factor analyses Cokely, E. T., & Kelley, C. M. (2009). Cognitive abilities and supe- of their data as well. rior decision making under risk: A protocol analysis and process In contrast to exploratory factor analysis as a data model evaluation. Judgment and Decision Making, 4,20–33. reduction tool, the Rasch analysis identifies a hypothetical Cole, J. C., Kaufman, A. S., Smith, T. L., & Rabin, A. S. (2004). unidimensional line on which items and persons are scaled Development and validation of a Rasch-derived CES-D short form. Psychological Assessment, 16, 360–372. on the basis of item difficulty and ability level. In turn, Cortina, J. M. (1993). What is coefficient alpha? An examination misfit items represent items that do not contribute to better of theory and applications. Journal of Applied Psychology, 78, identification of the construct. Hence, the reduced scale 98–104. requires fewer items to estimate the latent construct with Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in the same range of ability level as the full item pool. In our psychological tests. Psychological Bulletin, 52, 281–302. Dehaene, S. (1997). The number sense: How the mind creates study, we were able to substantially reduce an item pool mathematics. New York: Oxford University Press. Denes-Raj, V., & Epstein, S. (1994). Conflict between intuitive and rational processing: When people behave against their better judgment. Journal of Personality and Social Psychology, 66, Note that the Kaiser rule has the potential to overestimate the number of dimensions to retain (Zwick & Valicer, 1986). 819–829. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm J. A. Weller et al. Rasch-Based Numeracy Scale 211 Dieckmann, N. F., Slovic, P., & Peters, E. M. (2009). The use of Obrecht, N. A., Chapman, G. B., & Gelman, R. (2009). An encoun- narrative evidence and explicit likelihood by decision makers ter frequency account of how experience affects likelihood varying in numeracy. Risk Analysis, 29, 1473–1487. DOI: estimation. Memory & Cognition, 37, 632–643. 10.1111/j.1539-6924.2009.01279 Peters, E. (2012). Beyond comprehension: The role of numeracy in Dunning, D., Heath, C., & Suls, J. M. (2004). Flawed self-assessment: judgments and decisions. Current Directions in Psychological Implications for health, education, and the workplace. Psycholog- Science, 21,31–35. ical Science in the Public Interest, 5(3), 69–106. Peters, E., Dieckmann, N. F., Dixon, A., Hibbard, J. H., & Mertz, C. Educational Testing Service. (1992). National Adult Literacy Survey K. (2007). Less is more in presenting quality information to (NALS). Princeton, NJ: ETS. Retrieved from http://nces.ed.gov/ consumers. Medical Care Research and Review, 64, 169–190. pubsearch/pubsinfo.asp?pubid=199909 (14 August 2011). DOI: 10.1177/10775587070640020301 Embretson, S. E. (1996). The new rules of measurement. Psycho- Peters, E., Dieckmann, N. F., Västfjäll, D., Mertz, C. K., Slovic, P., & logical Assessment, 8, 341–349. Hibbard, J. H. (2009). Bringing meaning to numbers: The impact Estrada, C., Barnes, V., Collins, C., & Byrd, J. C. (1999). Health of evaluative categories on decisions. Journal of Experimental literacy and numeracy. Journal of the American Medical Associ- Psychology. Applied, 15,213–227. DOI: 10.1037/a0016978 ation, 282, 527. Peters, E., Hibbard, J. H., Slovic, P., & Dieckmann, N. F. Fagerlin, A., Zikmund-Fisher, B., Ubel, J., Peter, A., Jankovic, A., (2007). Numeracy skill and the communication, comprehen- Derry, H. A., & Smith, D. M. (2007). Measuring numeracy with- sion, and use of risk and benefit information. Health Affairs, out a math test: Development of the subjective numeracy scale. 26, 741–748. Medical Decision Making, 27, 672–680. DOI: 10.1177/ Peters, E., Västfjäll, D., Slovic, P., Mertz, C., Mazzocco, K., & 0272989X07304449 Dickert, S. (2006). Numeracy and decision making. Psycholog- Finucane, M. L., & Gullion, C. M. (2010). Developing a tool for ical Science, 17, 407–413. measuring the decision-making competence of older adults. Prieto, L., Alonso, J., & Lamarca, R. (2003). Classical test theory Psychology and Aging, 25, 271–288. DOI: 10.1037/a0019106 versus Rasch analysis for quality of life questionnaire reduction. Frederick, S. (2005). Cognitive reflection and decision making. Health and Quality of Life Outcomes, 1, 27. DOI: 1186/1477-7525 Journal of Economic Perspectives, 19,25–42. Rasch, G. (1993). Probabilistic models for some intelligence and Gosling, S. D., Vazire, S., Srivastava, S., & John, O. P. (2004). attainment tests. Chicago: Mesa Press (original work published Should we trust Web-based studies? A comparative analysis of in 1960). six preconceptions about Internet questionnaires. American Reyna, V. F., Nelson, W., Han, P., & Dieckmann, N. F. (2009). Psychologist, 59,93–104. How numeracy influences risk reduction and medical decision Hibbard, J. H., Mahoney, E. R., Stockard, J., & Tusler, M. (2005). making. Psychological Bulletin, 135, 943–973. Development and testing of a short form of the patient activa- Schwartz, L. M., Woloshin, S., Black, W. C., & Welch, H. G. (1997). tion measure. Health Research and Educational Trust, 40, The role of numeracy in understanding the benefit of screening 1918–1930. mammography. Annals of Internal Medicine, 127, 966–972. Hibbard, J. H., Slovic, P., Peters, E., Finucane, M. L., & Tusler, M. Simon, G. E., Ludman, E. J., Bauer, M. S., Unützer, J., & (2001). Is the informed-choice policy approach appropriate for Operskalski, B. (2006). Long-term effectiveness and cost of a Medicare beneficiaries? Health Affairs, 20, 199–203. systematic care program for bipolar disorder. Archives of Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis General Psychiatry, 63, 500–508. of decision under risk. Econometrica, 47, 263–291. Slovic, P., Finucane, M., Peters, E., & MacGregor, D. G. Kirsch, I. S., Jungeblut, A., Jenkins, L., & Kolstad, A. (2002). Adult (2002). The affect heuristic. In T. Gilovich, D. Griffin, & literacy in America: A first look at the findings of the National D. Kahneman (Eds.), Heuristics and biases: The psychology Adult Literacy Survey (3rd ed., Vol. 201). Washington, DC: of intuitive judgment (pp. 397–420). New York: Cambridge National Center for Education, US Department of Education. University Press. Kutner, M., Greenberg, E., Jin, Y., & Paulsen, C. (2006). The health Smith, P., & McCarthy, G. (1996). The development of a semi- literacy of America’s adults: Results from the 2003 National structured interview to investigate the attachment-related experi- Assessment of Adult Literacy (NCES 2006-483). Washington, ences of adults with learning disabilities. British Journal of DC: National Center for Education Statistics, US Department Learning Disabilities, 24, 154–160. of Education. Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the Liberali, J. M., Reyna, V. F., Furlan, S., Stein, L. M., & Pardo, S. T. sins of short-form development. Psychological Assessment, 12, (2011). Individual differences in numeracy and cognitive reflec- 102–111. tion, with implications for biases and fallacies in probability Stanovich, K. E, & West, R. F. (2008). On the relative indepen- judgment. Journal of Behavioral Decision Making. DOI: dence of thinking biases and cognitive ability. Journal of 10.1002/bdm.752 Personality and Social Psychology, 94, 672–695. Linacre, J. M. (2002). What do infit and outfit, mean-square and Thaler, R. H., & Sunstein, C. R. (2003). Libertarian paternalism. standardized mean? Rasch Measurement Transactions, 16, 878. American Economic Review, 93, 174–179. Lipkus, I. M., Peters, E., Kimmick, G., Liotcheva, V., & Marcom, Toplak, M. E., West, R. F., & Stanovich, K. E. (2011). The Cogni- P. (2010). Breast cancer patients’ treatment expectations tive Reflection Test as a predictor of performance on heuristics- after exposure to the decision aid program Adjuvant Online: and-biases tasks. Memory and Cognition, 39, 1275–1289. DOI: The influence of numeracy. Medical Decision Making, 30, 10.3758/s13421-011-0104-1 464–473. Woloshin, S., Schwartz, L. M., & Welch, H. G. (2004). The Lipkus, I. M., Samsa, G., & Rimer, B. K. (2001). General perfor- value of benefit data in direct-to-consumer drug ads. mance on a numeracy scale among highly educated samples. Health Affairs, W4,234–245. DOI: 10.1377/hlthaff.W1374. Medical Decision Making, 21,37–44. 1234 Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental Zikmund-Fisher, B. J., Smith, D. M., Ubel, P. A., & Fagerlin, A. test scores. Reading, MA: Addison-Wesley. (2007). Validation of the Subjective Numeracy Scale (SNS): MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. Effects of low numeracy on comprehension of risk communications (2002). On the practice of dichotomization of quantitative and utility elicitations. Medical Decision Making, 27, 663–671. variables. Psychological Methods, 7,19–40. DOI: 10.1177/0272989X07303824 National Center for Education Statistics (NCES). (2003). National Zwick, W. R., & Velicer, W. F. (1986). Comparison of five rules for Assessment of Adult Literacy (NAAL). http://nces.ed.gov/naal/ determining the number of components to retain. Psychological (15 August 2011) Bulletin, 99, 432–442. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm 212 Journal of Behavioral Decision Making Authors’ biographies: William J. Burns is a research scientist at Decision Research (Eugene, OR), whose current work focuses on modeling public Joshua A. Weller is currently a research scientist at Decision response and the subsequent economic impacts of disasters (special Research (Eugene, OR). His research focuses on how the ability emphasis on terrorism) on urban areas. to make advantageous decisions develops throughout the life- Ellen Peters is an associate professor in the Psychology Depart- span. Additionally, Dr. Weller is interested in understanding ment at The Ohio State University. She studies decision making how individual differences relate to risk taking and decision as an interaction of characteristics of the decision situation and char- making. acteristics of the individual. Her research interests include decision Nathan F. Dieckmann is a research scientist at Decision Research making, affective and deliberative information processing, emotion, (Eugene, OR). He conducts basic and applied research in decision risk perception, numeracy, and aging. making, risk communication, and statistical methodology. Martin Tusler is a research specialist in the Psychology Depart- Authors’ addresses: ment at The Ohio State University. He studies medical decision making, scale construction, and numeracy. Joshua A. Weller, Nathan F. Dieckmann, C. K. Mertz, and William J. Burns, Decision Research, Eugene, OR, USA. C. K. Mertz is a data analyst at Decision Research (Eugene, OR). Her research interests include multivariate statistical methods, risk Martin Tusler and Ellen Peters, Department of Psychology, The perception, and affect. Ohio State University, Columbus, OH, USA. Copyright © 2012 John Wiley & Sons, Ltd. J. Behav. Dec. Making, 26: 198–212 (2013) DOI: 10.1002/bdm

Journal

Journal of Behavioral Decision MakingPubmed Central

Published: Mar 15, 2012

There are no references for this article.