Behav Res (2018) 50:1198–1216 DOI 10.3758/s13428-017-0938-y Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study Lewis Pollock Published online: 13 July 2017 The Author(s) 2017. This article is an open access publication Abstract The purpose of this article is to highlight problems variables keep the standard deviations of the ratings of their with a range of semantic psycholinguistic variables (concrete- stimuli as low as possible. ness, imageability, individual modality norms, and emotional valence) and to provide a way of avoiding these problems. . . . Keywords Concreteness Semanticvariables List memory Focusing on concreteness, I show that for a large class of Methodology words in the Brysbaert, Warriner, and Kuperman (Behavior Research Methods 46: 904–911, 2013) concreteness norms, the mean concreteness values do not reflect the judgments that Word concreteness has become one of the most studied vari- actual participants made. This problem applies to nearly every ables in the psycholinguistic literature. Since Paivio, Yuille, word in the middle of the concreteness scale. Using list mem- and Madigan (1968) published one of the first large-scale ory experiments as a case study, I show that many of the databasesof wordconcretenessnorms, Bconcreteness effects^ Babstract^ stimuli in concreteness experiments are not un- have emerged in a variety of investigations of various cogni- equivocally abstract. Instead, they are simply those words tive processes, and a range of theories have been proposed in about which participants tend to disagree. I report three repli- an attempt to explain these effects. Independent teams of re- cations of list memory experiments in which the contrast be- searchers operating over a period of decades have repeatedly tween concrete and abstract stimuli was maximized, so that shown that concrete words show a processing advantage over the mean concreteness values were accurate reflections of par- abstract words in certain experimental paradigms. For exam- ticipants’ judgments. The first two experiments did not pro- ple, concrete words are easier to remember than abstract duce a concreteness effect. After I introduced an additional words (Allen & Hulme, 2006; Miller & Roodenrys, 2009; control, the third experiment did produce a concreteness ef- Romani, McAlpine, & Martin, 2008; Walker & Hulme, fect. The article closes with a discussion of the implications of 1999), are easier to make associations with (de Groot, 1989), these results, as well as a consideration of variables other than and are more easily and more thoroughly defined in dictionary concreteness. The sensorimotor experience variables definition tasks (Sadoski, Kealy, Goetz, & Paivio, 1997). (imageability and individual modality norms) show the same Historically, it was claimed that concrete words are responded distribution as concreteness. The distribution of emotional va- to more quickly than abstract words in lexical decision tasks lence scores is healthier, but variability in ratings takes on a (Bleasdale, 1987;James, 1975; Kroll & Merves, 1985), al- special significance for this measure because of how the scale though more recent experiments have shown no difference is constructed. I recommend that researchers using these (Brysbaert, Stevens, Mandera, & Keuleers, 2016), or even that abstract words might have an advantage after various other variables have been accounted for (Kousta, Vigliocco, Vinson, Andrews, & Del Campo, 2011). However, even an abstractness advantage in lexical decision points to the utility * Lewis Pollock firstname.lastname@example.org of word concreteness as a psycholinguistic variable. Brain-imaging techniques have also been employed to de- University College London, London, UK termine whether the neural systems underpinning concrete Behav Res (2018) 50:1198–1216 1199 words and abstract words are distinct (Binder, Westbury, sensory experience, the word Bserendipity^ should be McKiernan, Possing, & Medler, 2005; Dhond, Witzel, Dale, assigned a low concreteness rating. However, what are the & Halgren, 2007; Kounios & Holcomb, 1994;Pexman, properties that a word/concept should have in order for it to Hargreaves, Edwards, Henry, & Goodyear, 2007;Sabsevitz, be assigned a mid-scale rating? It is difficult to formulate a Medler, Seidenberg, & Binder, 2005). The general consensus coherent approach to this task: Can an entity or idea be from these brain-imaging studies is that there is evidence of a Bhalf-seen^ or Bhalf-touched^?What does it mean to have neuroanatomical difference in the processing of concrete ver- intermediate sensory experience of an entity or idea? That is sus abstract words. to ask: What is a participant telling us about a word when Psychologists are clearly heavilyinvestedinthe inves- they rate it a 3 out of 5? They could mean any one of the tigationofwordconcreteness,andfor goodreasons.If following: there are properties that define a cognitively relevant on- tology of concepts, concreteness seems like a good can- 1. Adding up all of my sensory experience of this object didate: Something about what constitutes the concept of across all five of the sensory modalities, I realize that I Belephants^ (highly concrete) is probably different from have seen and heard it, but never touched, smelled, or what constitutes the concept of Bparadoxes^ (highly ab- tasted it. So I suppose I’ll rate it a 3. stract). However, in this article I will highlight a problem 2. One interpretation of this word brings to mind something with the concreteness measure, based on a simple statisti- that cannot be directly experienced, whereas a different cal summary of the Brysbaert, Warriner, and Kuperman interpretation of this word brings to mind something that (2013) concreteness norms. I report three replication ex- can be directly experienced. So I suppose I’ll rate it a 3. periments that together suggest that this problem is not 3. Sometimes I associate sensory experience with this word, fatal to concreteness research, but also that it should be but sometimes I don’t. So I suppose I’ll rate it a 3. acknowledged when researchers design their stimuli. I al- so show that the same problem applies to other variables It is certainly possible to imagine more potential ap- in semantic databases, such as imageability (Cortese & proaches, and there is no empirical basis for selecting Fugett, 2004; Schock, Cortese, & Khanna, 2012)and in- one of these approaches over another. Furthermore, it is dividual modality norms (Lynott & Connell, 2012). likely that different participants will generate different in- terpretations for many of the words in any list of words to be normed. When a participant sees the letter string < Word concreteness deed > presented in isolation, there is no way that a re- searcher can control for the fact that half of the partici- pants may interpret < deed > as referring to a document Aword’s concreteness rating is derived by asking a group of participants to rate that word for concreteness on a Likert associated with proof of property ownership (high con- scale. A low score indicates that a word is highly Babstract,^ creteness value?), and the other half may interpret it as whereas a high rating indicates that a word is highly referring to some unspecified action, perhaps involving Bconcrete.^ The mean value of all participants’ ratings is some element of heroism (low concreteness value?). taken to be an approximation of a word’s position on an Consequently, for a number of words it is just not clear abstract-concrete continuum. I will now develop some theo- what word/concept the mean concreteness rating is sup- retical concerns about the validity of traditional concreteness posed to reflect. norms before turning to a statistical analysis of the Brysbaert This point on its own might be enough to motivate the et al. (2013) database. Consider the job a participant is being avoidance of words with a mean value in the middle of a asked to do when she is told to rate a word between, say, 1 concreteness–abstractness scale. Given that it is not clear and 5 on a scale of concreteness. She is told that Bconcrete what it is that participants are even telling us when they words are experienced by the senses,^ whereas abstract rate a word a 3, we might also wonder how often partic- words are not (Paivio et al., 1968). For some words, the ipants actually use values from the middle of the concrete- interpretation of traditional concreteness norming instruc- ness scale when making their judgments. Recently, tions is relatively straightforward. A participant who is pre- Brysbaertetal. (2013) provided a concreteness norm da- sented with the word Bapple^ is likely to have seen, touched, tabase of 40,000 English words, which dwarfs the previ- smelled, and tasted apples throughout the course of their life, ously popular MRC database used in most studies and will unproblematically assign Bapple^ a high concrete- (Coltheart, 1981). This new, larger database allows a sta- ness rating. Similarly, a participant that is presented with the tistical analysis of the distributions of concreteness norms word Bserendipity^ is likely to reason that since serendipity across a much larger section of the English lexicon. I now is a loose association between some coincidental, present this analysis and use it to develop the concerns nonspecified events, and is not something that affords direct raised in this section. 1200 Behav Res (2018) 50:1198–1216 Brysbaert et al. (2013) concreteness norms theoretically possible for a data point to occur with a mean value located in the middle of the scale, but with a relatively Brysbaert et al. (2013) collected a new set of concreteness low standard deviation. That is, it is still clearly theoretically norms for 40,000 English words. Groups of approximately possible for participants to more or less consistently agree that 25 participants rated subsets of the whole list of 40,000 words a wordisofintermediateconcreteness. on a concreteness scale of 1 (very abstract)to5(very Now, consider Fig. 2, which plots the actual mean con- concrete). The participants (n = 4,237) came from a range of creteness value and the standard deviation of every noun in ages, with approximately one third between 17 and 25 years the Brysbaert et al. (2013) concreteness norm dataset (n = old, and two thirds between 26 and 65. The mean value of a 14,592) over the top of the theoretically possible combina- group of participants’ judgments about the concreteness of a tions depicted in Fig. 1. stimulus word was assumed to be a useful approximation of The pattern is striking. At the extreme concrete end of the that word’s position on a hypothesized concrete–abstract con- scale, many items have high concreteness ratings and relative- tinuum. I shall now argue that this is not necessarily the case. ly low standard deviations, indicating that participants more or The standard deviation of a dataset is a measure of the average less agreed in their judgments about how to rate these words. distance between all data points in that dataset and the mean At the extreme abstract end of the scale, there are likewise value of all data points in the dataset. If every participant rates words with low concreteness ratings and relatively low stan- awordas a 1(highly abstract), then that word’s concreteness dard deviations, although not to the same extent as at the rating will have a standard deviation of 0. However, if half of extreme concrete end. However, in the middle of the scale the participants rated a word as a 1, but the other half rated the there is an obvious rise in the standard deviation. Only a hand- word as a 5 (highly concrete), that word would have a mean ful of words have a mean value near 3 and a standard deviation concreteness rating of 3 but a standard deviation of 2. In Likert even slightly below 1. Indeed, a large class of words have a scale norming tasks, the standard deviation of a set of ratings standard deviation well over 1, ranging from mean values of is therefore a blunt index of the extent to which participants 1.5 to 4.5. agreed with each other about how a word should be rated. This indicates that for a great number of items, participants If a dataset contains 25 numbers (in our case, 25 individual were not agreeing in their judgments of how concrete a stim- concreteness judgments), all of which are integers between 1 ulus word was. At mean values of 2 and 4 there are many and 5, then there are a finite number of possible combinations cases of standard deviations above 1. Remember that ratings of means and standard deviations for that dataset. Figure 1 on this scale can only take integer values between 1 and 5. below plots all of these possible combinations: This means that for many of the words with a mean value of 2 Note how, at the extreme ends of the x-axis, only a standard or 4, some participants must have judged these words as be- longing at the opposite end of the concreteness scale from the deviation of 0 is possible, because for a mean value to be 1 or 5, all 25 participants must have rated a word as 1 or 5, respec- position where the mean value suggests the word belongs. tively. However, in the middle of the scale the disagreement This phenomenon is problematic for the assumption that con- that is theoretically possible increases, reaching a peak at creteness should be treated as a continuous variable. This is mean value ~3, standard deviation ~2. Crucially, it is still because in a vast number of cases, participants’ judgments tended not to be continuous; instead, they tended to be binary: Fig. 1 Theoretically possible locations for words rated between 1 and 5 by 25 different participants Fig. 2 Theoretical versus actual locations Behav Res (2018) 50:1198–1216 1201 Participants were using values of 1, 2, 4, and 5 in producing nominal. Therefore, to display the maximum number of these concreteness norms, and avoided using 3. Furthermore, stimuli for all experiments, I have plotted the entire in many cases participants were judging a word as a 1 (totally Brysbaert et al. (2013) database (n = 40,000) instead of just abstract), whereas others were judging that same word as a 4 the nominal subsection of it. Not all of the stimuli featured in (somewhat concrete). all experiments appeared in the Brysbaert et al. norms, and Given these methodological issues, it might seem surpris- these stimuli have been omitted from the analysis. Second, the ing that concreteness effects are so widely reported. If mea- pattern of means and standard deviations is absolutely un- surements for a large section of the hypothesized concreteness changed when we compare the entire Brysbaert et al. database spectrum are actually procedural artifacts, it is then unclear with the noun subsection of it. what phenomenon it is that concreteness effects are actually Now, consider Fig. 3. The stimuli featured in Romani et al. indexing. One potential explanation is that generally, when (2008) best exemplify the problem, although the intention investigating the effect of a variable, researchers try to choose here is not to single out Romani et al. or any of the other stimuli that maximize a change in this variable, in order to authors under discussion for criticism. The analysis I present generate the maximum possible effect. It is therefore possible here would have been almost impossible to carry out at the that empirical concreteness research might not suffer too badly time that these experiments were conducted, given that the from the problem of binary disagreements concerning Brysbaert et al. concreteness database was only published in midscale items, because researchers will have aimed to pick 2013. In brief, the problem is that the concrete words tend to stimuli from the extreme ends of the scale, and these polar have low standard deviations, whereas the abstract stimuli items are less subject to disagreement. tend to have high standard deviations and to be drawn from However, if it turns out that many experimental stimuli do the middle of the scale, rather than the unequivocally abstract suffer from the disagreement phenomenon, this poses an ex- part of the scale. This is potentially problematic for the validity planatory problem concerning the evidence in favor of pro- of Romani et al.’s conclusions regarding concreteness effects, cessing differences between abstract and concrete items. The because many of the stimuli that made up their abstract stimuli typical finding is that there are processing advantages for con- were not unequivocally abstract. For the standard deviations crete items relative to abstract items, and the typical explana- of many of the Babstract^ stimuli to be as high as they are—in tion of this finding is that concrete and abstract items have many cases, well above 1—many participants must have been different neurologically instantiated formats and/or structural judging those words to be concrete during the Brysbaert et al. relationships. If a significant number of the stimuli included in (2013) norming process. Some of the abstract stimuli have an abstract or concrete experimental condition actually come standard deviations approaching the theoretical maximum of from the middle of the concreteness scale, then the typical 2, indicating maximum disagreement among participants about whether that word is concrete or abstract. To reiterate: claim that there are processing differences between concrete and abstract items is no longer supported by the data. This is Participants could only apply integer values in making their because words from the middle of the scale must have high judgments. Therefore, even if a word has a mean concreteness standard deviations. This means that only half of the partici- rating of approximately 2, but also a standard deviation of the pants who produced the concreteness measure for that word rating above 1, that means that some participants must have judged it to be abstract, and the other half judged it to be been crossing scale halves in making their judgments. concrete. Therefore, there are no empirical grounds for calling these words Bconcrete^ or Babstract^ in the first place. Stimuli in concreteness experiments: A case study of list memory paradigms In this section I plot the stimuli featured in four list memory experimental studies against the entire Brysbaert et al. (2013) database. These studies are Allen and Hulme (2006), Walker and Hulme (1999), Romani et al. (2008), and Miller and Roodenrys (2009). We should note a few things. First, al- though the replication experiments that I report below feature noun stimuli, and most studies under discussion here also featured nouns, occasionally their stimulus sets featured other word classes alongside nouns. In the case of Allen and Fig. 3 Romani et al. (2008)stimuli Hulme, many of the stimuli in the abstract condition were not 1202 Behav Res (2018) 50:1198–1216 Ultimately, it is not clear what comparison is actually being made here. The concrete stimulus lists were more or less unproblematically concrete. However, the abstract stimulus lists contained words drawn from nearly the entire length of the concreteness scale, and also tended to feature words that participants disagreed about how to rate. Figure 4 depicts the abstract and concrete stimuli featured in Allen and Hulme (2006). Again, many Babstract^ stimuli here have standard deviations well above 1, indicating that people disagreed about whether the words were abstract in the first place. The range of mean ratings of concreteness for the abstract condition is also clearly much higher than in the concrete condition. Once again, a relatively homogeneous group of concrete words has been compared to a heteroge- neous group of words about which participants tended to disagree. Fig. 5 Miller and Roodenrys (2009)stimuli Figure 5 plots the stimuli featured in Miller and Roodenrys (2009). Again, there is a marked difference in standard devi- ations between the concrete and the abstract stimuli. could be the case that words that engender disagreement are Furthermore, the standard deviations of the abstract stimuli those that are hard to remember, and that this explains pro- are so high (well above 1 in the majority of cases) that the cessing differences that have previously been attributed to mean value does not reflect the judgments that participants concreteness/abstractness. The experiments that I report be- were actually making. low were designed to test this possibility. Finally, consider Fig. 6, which depicts the stimuli featured Before moving on to a report of these replication attempts, I in Walker and Hulme (1999). The midscale criticism applies wish to point out that list memory paradigms are not a special least to this set of stimuli, although it is still clearly the case case when it comes to the properties of Babstract^ stimuli. that the concrete stimuli tended to have lower standard devi- Table 1 presents a number of experimental concreteness stud- ations than the abstract stimuli. The reasons for this have al- ies from a wide variety of paradigms, as well as a summary of ready been expounded. The upshot is that a skeptic could the concreteness values and standard deviations of the stimuli reasonably argue that these experiments do not actually pro- featured in their experiments. The abstract–midscale stimulus vide evidence for concreteness effects. The reason is that the pattern applies to every single experiment. comparison being made was meant to be between concrete Once again, I stress that none of the analysis presented here and abstract items, but the comparison that was actually made is intended as a specific criticism of any of these studies. was between concrete items, on the one hand, and a group of These studies were chosen simply because they reflect a range stimuli about which participants disagree, on the other. It Fig. 4 Allen and Hulme (2006)stimuli Fig. 6 Walker and Hulme (1999)stimuli Behav Res (2018) 50:1198–1216 1203 Table 1 Concreteness statistics in various experimental paradigms Article Type of Data Experimental Concrete Abstract Paradigm Mean Mean SD Mean Mean SD Concreteness Concreteness Kroll & Merves (1985) Behavioral Lexical decision 4.55 0.74 2.17 1.22 de Groot (1989) Behavioral Word association 4.66 0.6 2.36 1.24 Paivio et al. (1994) Behavioral Recall 4.83 0.47 2.29 1.28 Gee et al. (1999) Behavioral Recall 4.73 0.57 3 1.33 Binder, Nelson, & Krawczyk (2005) fMRI Lexical decision 4.76 0.52 2.34 1.23 Crutch & Warrington (2005) Patient population Word matching 4.83 0.46 3.53 1.18 Sabsevitz et al. (2005) fMRI Semantic judgment 4.86 0.45 2.58 1.31 ter Doest & Semin (2005) Behavioral Recall 4.72 0.57 2.45 1.26 Lee & Federmeier (2008) EEG Semantic judgment 4.41 0.88 2.27 1.24 Huang et al. (2010) EEG Semantic judgment 3.82 1.17 2.53 1.21 Skipper-Kallal, Mirman, & Olson (2015) fMRI Deep thought 4.44 0.81 2.38 1.22 Jager & Cleland (2016) Behavioral Lexical decision 4.62 0.64 3.29 1.19 of experimental paradigms (lexical decision, recall, semantic that participants were better at recalling lists of words that judgment, word association, and picture–word matching), da- consisted entirely of concrete words than at recalling lists that ta types (behavioral, fMRI, electroencephalography [EEG]), consisted entirely of abstract words. Experiment 1 here inves- and include both neurotypical and patient populations. They tigated the reliability of this concreteness effect when the stan- also, laudably, included their stimulus sets in their dard deviations of the concreteness value of the words across experimental reports, although it is important to note that for lists was controlled, while also directly manipulating words’ Sabsevitzetal. (2005) and Lee and Federmeier (2008)only standard deviation in order to ascertain whether the standard samples of the stimuli were available. For every study but one deviation itself has a significant effect on task performance. listed in Table 1, the mean standard deviation of the stimuli in Figure 7 plots the mean concreteness values and standard the concrete condition was below 1, whereas the mean stan- deviations of concreteness of the concrete and abstract stimuli dard deviation of the stimuli in the abstract condition was used in the present experiment in the same way that the stimuli above 1. The only exception is Huang, Lee, and Federmeier used in previous experiments were plotted in the previous (2010), in which the standard deviations for both stimulus sets section. were relatively high. Looking at the distributions displayed We can see that the contrast in concreteness between con- above in Figs. 2, 3, 4, 5 and 6, it is clear that the only way ditions is maximized and that the difference in the standard these statistics could be obtained is if the midscale disagree- ment problem applied to all of the abstract stimulus sets of the experiments depicted in the table. I now turn to a report of three new list memory replication experiments in which I attempted to control for the problems that I have outlined so far. Experiment 1 The purpose of this experiment was to replicate an experiment reported in Romani et al. (2008) while controlling for the potentially problematic confound between the mean value of a concreteness rating and the standard deviation of that rating. Romani et al. presented participants with lists of words and asked them to recall words from that list immediately after the Fig. 7 Concrete and abstract stimuli featured in Experiment 1 presentation of the last word of the list. Romani et al. reported 1204 Behav Res (2018) 50:1198–1216 deviations of concreteness ratings is controlled. Of interest is Psycholinguistic variable information was gathered from whether the concreteness effect would still occurs when these Brysbaertetal. (2013), Kuperman, Stadthagen-Gonzalez, new controls were enforced. and Brysbaert (2012), and the English Lexicon Project The specific Romani et al. (2008) experiment replicated (Balota et al., 2007). The stimulus sets were created using here is Experiment 3 B, which is a free-recall task in which MATCH(vanCasteren& Davis, 2007). The four conditions participants simply try to recall any word from the list that were concrete, abstract, agreement, and disagreement. they can, regardless of order. Romani et al. reported that con- Concrete lists contained words that had mean values creteness effects are stronger in free-recall than in serial-recall between 4 and 5 on the Brysbaert et al. (2013) concreteness tasks, so a free-recall task provides the most robust test of the scale. Abstract lists contained words that had mean values concreteness effect. An additional two experimental condi- between 1 and 2 on the Brysbaert (2013) concreteness scale. tions were added: agreement and disagreement conditions. The agreement and disagreement lists contained words that Words in the agreement condition were taken from the middle had mean values between 2.5 and 3.5 on the Brysbaert et al. of the scale and had relatively low standard deviations, and (2013) concreteness scale. The concrete, abstract, and agree- words in the disagreement condition were taken from the ment lists were constructed such that the standard deviations middle of the scale and had relatively high standard devia- of the concreteness ratings of the words in those lists were tions. Summary psycholinguistic statistics for all conditions similar, whereas the disagreement condition was formed ex- are given in the Materials section below. Three comparisons clusively of stimuli with high standard deviations. Table 3 were of interest: concrete versus abstract, concrete versus dis- contains a sample list from each condition, and full lists of agreement, and concrete versus agreement. In this way, the the stimuli featured in all experiments reported in this study importance of the midscale problem outlined in the section are included in the Appendix. above can be assessed. Procedure The experimenter read all of the words from a list one after the other. There was a 2-s pause between consecutive Method words being read out. The order of the lists and the order of the words within each list were randomized for each participant. Participants Originally, 60 native speakers of English with no After the experimenter had finished reading out a list, the reported neurological disorders were recruited from the participant spoke out loud any and all words that he or she University College London SONA psychology pool. Of these, could remember from that list. The experimenter recorded 50 completed the experiment (the other ten either did not turn every word that the participant spoke. Because this was a up or canceled their session). All participants were either free-recall task, the order in which the participants recalled awarded course credit or paid £6 for their time. the words did not matter. Participants were not penalized for making errors or substitutions, or for saying a word that had not actually been in the list. The experiment lasted approxi- Materials Forty lists, each containing eight words, were gen- mately 35 min. erated. There were four experimental conditions, each of which comprised ten lists. The stimuli were controlled for the following psycholinguistic variables: standard deviation of concreteness, frequency, age of acquisition, number of pho- Results nemes, number of letters, and number of syllables. Table 2 contains the mean values (with standard deviations in paren- Table 4 summarizes the mean numbers of words remembered theses) of each of these variables for each condition. (and standard deviations) by condition. Table 2 Stimulus properties Condition Mean Concreteness SD Concreteness AoA Zipf Frequency L Phon Length N Syll Concrete 4.38 (0.17) 1.02 (0.11) 10.45 (2.05) 3.34 (0.79) 5.59 (0.94) 6.93 (1.06) 2.00 Abstract 1.78 (0.14) 1.04 (0.12) 10.58 (2.09) 3.38 (0.83) 5.56 (0.93) 6.84 (1.17) 2.00 Agree 3.17 (0.7) 1.08 (0.07) 10.09 (1.9) 3.15 (0.85) 5.63 (1.03) 6.93 (1.21) 2.00 Disagree 3.1 (0.36) 1.65 (0.05) 10.23 (2.04) 3.13 (0.81) 5.76 (1.10) 6.9 (1.32) 2.00 Mean concreteness: Mean concreteness rating; SD concreteness: The mean standard deviation of the concreteness ratings; AoA: Age of acquisition; Zipf frequency: Word frequency in Zipf units; L Phon: Length of word in phonemes; Length: Length of word in letters; N Syll: Number of syllables Behav Res (2018) 50:1198–1216 1205 Table 3 Sample stimulus lists Condition Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 Concrete Beaker Clinic Tango Clothing Amber Jackal Roulette Survey Abstract Desire Mystique Intent Vantage Glory Nuance Unease Motive Agree Diesel Roughhouse Attempt Whiner Viewpoint Freshness Stampede Leader Disagree Slipstream Audit Poorhouse Minute Rival Tribune Abyss Spectrum The results were analyzed with a mixed-effects model in R could be the reason that no statistically significant results were using the lme4 package (Bates, Mächler, Bolker, & Walker, obtained. 2015). The lmertest package was used in order to obtain p- To account for this possibility, the data were reanalyzed values for the comparisons of interest via Satterthwaite ap- using a Bayesian model comparison analysis in the proximation (Kuznetsova, Brockhoff, & Christensen, 2015). BayesFactor package for R (Morey, Rouder, & Jamil, 2015) The mixed-effects model examined the fixed effect of exper- with the default settings and priors. If the results of the imental condition on the number of words remembered per frequentist analysis presented in the preceding paragraphs trial, with subjects and items being treated as random effects were due to low power, then the Bayes factors produced by with varying intercepts. this analysis are likely to be between 1/3 and 3, which would The statistical contrasts were the abstract, disagreement, indicate that the data do not decide the issue either way. and agreement conditions versus the concrete condition. Kruschke (2011, p.310)arguedthatthe Bayesfactor gen- That is, a treatment contrast with the concrete condition erated from a model comparison analysis of an experimental representing the baseline condition. Table 5 displays the re- design with multiple conditions may be misleading for various sults of this analysis. reasons. Therefore, the total results dataset of Experiment 1 Because three nonindependent hypothesis tests were run on was partitioned into three smaller datasets that reflected the the same data, a Bonferroni correction was applied. Assuming pairwise comparisons of interest between the conditions: one a conventional alpha level of .05, the corrected alpha level was concrete–abstract comparison, one concrete–agree compari- therefore .05/3 = .017. The concrete–abstract contrast was not son, and one concrete–disagree comparison. In every case, a statistically significant (p = .13). Therefore, there was no evi- model including a parameter for the fixed effect of condition dence for an advantage for concrete over abstract word lists, was compared to a null model that featured only subjects and contrary to the findings of Romani et al. (2008), Walker and items as random effects. The resulting Bayes factors for each Hulme (1999), Allen and Hulme (2006), and Miller and comparison were concrete versus abstract, 0.32; concrete ver- Roodenrys (2009). None of the other contrasts were statisti- sus agree, 0.38; concrete versus disagree, 0.66. For the con- cally significant, either, at the Bonferroni-corrected alpha level crete–abstract comparison, there is marginal evidence in favor (concrete vs. agreement, p = .08; concrete vs. disagreement, of a null effect (BF = 0.32). For the other two comparisons, the p = .02). There was therefore no evidence words from the Bayes factor indicates that the data do not decide between the middle of the concreteness scale are simply harder to remem- null or alternative models. Taken together with the frequentist ber than words from the extreme concrete end of the scale, and analysis presented previously (all p values above the threshold there was no evidence that words with high standard devia- for statistical significance), these results suggest no difference tions in rating are harder to remember than words from the in recall between the concrete and abstract conditions. extreme concrete end of the scale. However, a reviewer raised However, the evidence for a null difference in the other com- the important point that Experiment 1 suffered from a lack of parisons is inconclusive. power, because there were only ten items per condition. This Before moving on to the second replication experiment, it is important to note a shortcoming of Experiment 1 that may have affected the results. The standard deviations of the concreteness ratings of both the concrete and abstract Table 4 Mean words recalled by condition for Experiment 1 stimuli were relatively high: above 1, in many cases. It Condition Mean Words Mean Percentage could be that, given the concerns raised in previous sec- Recalled (SD) Recalled tions, neither condition provided an accurate sample from the truly concrete or abstract sections of the scale. In the Concrete 4.67 (1.35) 58.4% second experiment that I will report, the standard devia- Abstract 4.48 (1.24) 56% tions of the conditions were more tightly constrained so Disagree 4.38 (1.28) 54.6% that in the concrete and abstract conditions, all standard Agree 4.45 (1.35) 55.6% deviations were below 1. 1206 Behav Res (2018) 50:1198–1216 Table 5 Summary of mixed-effects model for Experiment 1 Fixed Effects Effect Estimate Error df t p Lower 95%CI Higher 95%CI for Effect for Effect Abstract –.19 –.12 39.25 –1.56 .13 –.43 .05 Agree –.22 –.12 39.25 –1.79 .08 –.46 .03 Disagree –.29 –.12 39.25 –2.42 .02 –.54 –.05 Experiment 2 condition, and therefore each condition included 16 words, for a total of 24 critical item pairs overall. Paivio, Walsh, and Bons (1994) presented participants with lists consisting of both concrete and abstract word pairs and reported that concrete word pairs were recalled better than Procedure Participants undertook the experiment online via a abstract word pairs. This effect has been obtained in many Qualtrics survey distributed over the Prolific Academic ser- paired-associate learning experiments (Begg, 1972; Nelson vice. Participants were presented with pairs of words, one after &Schreiber, 1992; Paivio, Khan, & Begg, 2000;Paivio the other. Following Marschark and Hunt (1989) and Paivio et al., 1994). Paivio et al. employed a range of different ma- et al. (1994), each pair of words was presented on the partic- nipulations across two experiments. In this replication I fo- ipant’s computer screen for 8 s. Eight pairs were presented in cused on the simplest version of this paradigm, which is a each of the three conditions, and all pairs were presented in a free-recall task, in order to make the results maximally com- randomized nonblocked order for each participant. The order- parable to those of Experiment 1 above. The aim of the present ing of the words in each pair from left to right on the computer experiment was to test whether a concreteness effect still oc- screen was not randomized. At the beginning and end of the curs if the contrast between concrete and abstract stimuli is list, three pairs of filler items were included in order to soak up maximized and the standard deviations of their concreteness primacy and recency effects. Participants also received a short scores are controlled. In addition to the concrete and abstract practice trial with words not included in the main experiment, conditions featured in the paired-associate learning studies to ensure that they understood the task and that their com- mentioned in this section, the present experiment also includ- puters and Internet connections were working properly. ed a midscale condition to provide a second test of the hypoth- Once the list of pairs was finished, participants could type esis that high-standard-deviation midscale words are harder to out any and all words that they remembered from the list. remember than words from the concrete end of the concrete- Once they were finished, they pressed a BSubmit^ button that ness scale. ended the experiment. There were three experimental condi- tions: A word pair could consist of concrete, abstract, or midscale Bdisagreement^ items. The experiment lasted ap- proximately 15 min. Method Participants Sixty native speakers of English with no report- ed neurological disorders were recruited from the Prolific Academic website. All participants were paid £6 for their time. Materials Figure 8 depicts the means and standard deviations of the concreteness ratings for the concrete and abstract stim- uli in Experiment 2. Table 6 displays the psycholinguistic characteristics of the stimuli featured in the experiment, by condition. In Experiment 2 the additional control variable of mean bigram frequency was introduced, because participants would be reading and writing words as opposed to hearing and Fig. 8 Concrete and abstract stimuli featured in Experiment 2 speaking them. There were eight pairs of words in each Behav Res (2018) 50:1198–1216 1207 Table 6 Summary of stimulus characteristics for Experiment 2 Condition Mean Concreteness SD Concreteness AOA Zipf Frequency L Phon N Syll Length BG Mean Concrete 4.51 (0.23) 0.91 (0.13) 9.92 (1.9) 3.54 (0.56) 4.75 (0.2) 1.75 (0.43) 6.125 (1.41) 3,573 (1,151) Abstract 1.61 (0.17) 0.81 (0.11) 10.04 (1.64) 3.48 (0.69) 5.25 (1.44) 1.75 (0.43) 6.44 (1.5) 3,457 (1,176) Disagreement 3 (0.23) 1.33 (0.02) 9.78 (1.95) 3.72 (0.78) 5.75 (1.48) 1.81 (0.39) 6.38 (1.45) 3,218 (957) Mean concreteness: Mean concreteness rating; SD concreteness: The mean standard deviation of the concreteness ratings; AoA: Age of acquisition; Zipf frequency: Word frequency in Zipf units; L Phon: Length of word in phonemes; N Syll: Number of syllables; Length: Length of word in letters; BG mean: Mean bigram frequency Results Interim summary Table 7 displays the mean numbers of words remembered Experiments 1 and 2 did not produce a concreteness effect. across conditions in Experiment 2. This is worrying, given the concerns about the typically high The numbers of words recalled out of 16 were low, but the standard deviations of abstract stimuli outlined above. If we variability across participants was large, as indicated by the increased a difference between conditions on some linear relatively high standard deviations of the mean numbers of measure, we would not expect experimental effects based on words recalled. This suggests floor effects for some partici- this measure to disappear. However, Kousta, Vinson, and pants. Second, the mean number of words in the abstract con- Vigliocco (2009) showed that words with a high emotional dition was numerically larger than that in the concrete condi- valence (whether positive or negative) enjoy a processing ad- tion (3 < 3.43), so already we have failed to find evidence in vantage over words with neutral emotional valance. Abstract favor of a concrete stimulus advantage in paired-associate words tend to be rated higher for emotional valance than con- learning. Finally, the difference between the means of the crete words, and this variable was not controlled in concrete and disagree conditions was miniscule (3 vs. 3.05, Experiment 1 or 2. Thus, it could be that a confound in the respectively). stimuli used in Experiments 1 and 2 obscured any concrete- The data were analyzed using a generalized linear mixed ness effect. Warriner et al.’s(2013) emotional valance norms model fit by maximum likelihood (Laplace approximation) for ~14,000 English words would allow us to check this pos- using the glmer function from the lme4 package in R. The sibility. Emotional valance is rated on a scale of 1 (highly dependent variable in this analysis was therefore the likeli- negative)to9(highly positive), with a score of 5 indicating hood of a participant recalling any word. Subjects and items an emotionally neutral word. Given that either emotional pos- were included as random effects with varying intercepts, and itivity or negativity results in a processing advantage, the ab- the fixed effect of condition was the effect of interest. Both solute value of 5 minus the emotional valance of a word pro- abstract and disagree conditions were compared to the con- vides a simple linear measure of emotional valance that ig- crete condition. The results of this analysis are presented in nores polarity (0 = totally neutral,4= highly emotionally Table 8. valenced). Table 9 presents the mean absolute emotional va- Experiment 2 generated no statistically significant effects: lences of the stimuli featured in Experiments 1 and 2. p = .2 for the concrete–abstract contrast, and p =.88 for the The words in the concrete and midscale conditions were concrete–disagree contrast. This pattern of results is the same indeed less emotionally valenced than those in the abstract as that found in Experiment 1: Under conditions that should conditions in both experiments, so this might explain the null have made a concreteness effect stronger, such an effect was results obtained from Experiments 1 and 2. not obtained. However, ultimately we should be cautious in Another potential issue is that the words featured in drawing any conclusions from the results of Experiment 2, Experiments 1 and 2 were of relatively low frequency (be- because floor effects may have obscured any differences be- tween 3 and 4 on the Zipf scale), so it could be that participants tween conditions. did not know all of the words. This could have obscured any effect of manipulating concreteness. Brysbaert et al. (2013) provided a measure of how many of their participants reported A reviewer noted that analyzing the data in this way meant that this exper- that they knew a word. Table 10 below displays the mean iment was arguably no longer a paired-associate learning task, presumably percentages of participants who reported knowing a word because it did not account for the paired relationship between the words. In for each condition in Experiments 1 and 2. their free-recall analyses, Paivio et al. (1994) calculated the proportions of words remembered and conduct by-subjects and by-items ANOVAs on these proportions. These analyses also ignored word pair relationships and produced concreteness effects, so I think we would still expect the analysis presented My thanks to an anonymous reviewer for bringing this to my attention. here to produce a concreteness effect. Again, my thanks to an anonymous reviewer for pointing this out. 1208 Behav Res (2018) 50:1198–1216 Table 7 Mean words recalled by condition in Experiment 2 Table 9 Emotional valences of stimuli featured in Experiments 1 and 2 Condition Mean Words Mean Percentage Experiment Concrete Abstract Disagree Agree Recalled Recalled 1 0.82 1.17 0.88 1.15 Concrete 3 (2.73) 18.6% 2 0.91 1.61 0.99 N/A Abstract 3.43 (3.07) 21.5% Disagree 3.05 (2.84) 19.1% and lasted approximately 35 min. Participants were paid £5 for their time. These percentages are high, so it is likely that the number of participants in Experiments 1 and 2 who did not know a word Materials The stimuli were controlled for the following psy- was very low. However, it would obviously be preferable if cholinguistic variables: standard deviation of the concreteness only words with known percentages of 100% were used. rating, frequency, age of acquisition, number of syllables, Unfortunately, for reasons detailed in the General Discussion number of letters, mean bigram frequency, and emotional va- below, enforcing this control raised new problems. I now re- lence. Table 11 contains the mean values (with standard devi- port an additional list memory experiment that controlled for ations in parentheses) of each of these variables for each con- emotional valence, in order to provide a better test of the dition, as well as the mean percentages of people in the robustness of the concreteness effect. Brysbaert et al. (2013)norms whoreportedknowing the words in each condition. Experiment 3 There were three experimental conditions: concrete, abstract, and midscale. There were 15 six-word lists in each condition. Experiment 3 was a free-recall list memory experiment in the vein of Experiment 1. There were three changes to the para- digm. First, six-word lists were used instead of eight-word Procedure Participants were presented with words in se- lists. This change was made so that more trials per condition quence one at a time in the center of their computer screens. (15 in Exp. 3 vs. 10 in Exp. 1) could be fitted into roughly the As in Romani et al.’s(2008) visual paradigms, each word same amount of time. Romani et al. (2008) and Miller and remained on the screen for 3 s. After each list had been pre- Roodenrys (2009) both reported concreteness effects with six- sented, participants typed out any and all words that they word lists. Second, the words were presented visually, and could remember. They were told that the order of the words participants wrote out the words at the end of a list instead did not matter and not to worry about spelling. Participants of speaking them out loud. This change was made because to received two practice trials in order to ensure that they under- maximize efficiency, the experiment was run over the Internet stood how to complete the experiment. The orders of the lists using the Gorilla.sc platform. Finally, only three conditions and of the words within each list were randomized for each were included: concrete, abstract, and midscale words with participant. high standard deviations. Results Method Table 12 summarizes the mean numbers of words remem- Participants A total of 70 participants were recruited from the bered (and standard deviations) by condition. Prolific Academic website. Of these, 62 completed the exper- The results from Experiment 3 were analyzed in the same iment. The other eight did not respond to every trial, and so way as the results from Experiment 1. Both frequentist and were excluded. The experiment was delivered via Gorilla.sc Bayesian analyses are presented. Table 13 displays the results Table 8 Summary of a generalized linear mixed model analysis of Table 10 Mean percentages of participants who reported in Brysbaert Experiment 2 et al. (2013) knowing the words featured in Experiments 1 and 2 Effect Effect Estimate Std. Error zp Experiment Concrete Abstract Disagree Agree Abstract .19 .15 1.3 .2 1 98.5% 98.3% 97.7% 98.5% Disagree .02 .15 .15 .88 2 99.5% 99.1% 98% N/A Behav Res (2018) 50:1198–1216 1209 Table 11 Summary of stimulus characteristics for Experiment 3 Condition Mean SD AoA Zipf N Syll Length BG mean Absolute Percent Concreteness Concreteness Frequency Valence Known Concrete 4.55 (0.17) 0.81 (0.12) 10.11 (1.28) 3.41 (0.48) 2.42 (0.86) 7.63 (1.79) 3,649 (1,134) 1.12 (0.77) 99% Abstract 1.61 (0.15) 0.85 (0.11) 10.2 (1.95) 3.54 (0.72) 2.53 (0.89) 7.63 (1.95) 3,710 (1,208) 1.15 (0.78) 99% Midscale 3.02 (0.26) 1.51 (0.77) 10.11 (1.99) 3.53 (0.72) 2.54 (0.86) 7.57 (1.89) 3,737 (1,184) 1.15 (0.77) 98.7% Mean concreteness: Mean concreteness rating; SD concreteness: The mean standard deviation of the concreteness ratings; AoA: Age of acquisition; Zipf frequency: Word frequency in Zipf units; N Syll: Number of syllables; Length: Length of word in letters; BG mean : Mean bigram frequency; Absolute Valence: Absolute value of 5 minus the Warriner et al. (2013) emotional valence score. of a mixed-effects linear model with a fixed effect of condition General discussion and random intercepts for subjects and items. After controlling for the effects of emotional valence, these The first two experiments did not produce a concreteness ef- results are much more encouraging for the status of concreteness fect, but these experiments featured a confound: The abstract as a useful psycholinguistic variable. The concrete–abstract com- stimuli had higher emotion ratings than the concrete stimuli. parison is statistically significant at p = .003, and the difference is Experiment 3 controlled for emotional valence, and the typical in the direction we would expect. The contrast between the con- concreteness effect reemerged. This highlights the importance crete and midscale conditions was not statistically significant of controlling for emotional valence in list memory para- (p = .08). Because this experiment still featured a relatively small digms. There were no statistically significant differences be- number of items, a Bayesian model comparison analysis was tween the concrete conditions and the midscale conditions in deployed in an attempt to offset a potential lack of power. any experiment. This demonstrates that researchers who are Again, the default settings and priors of the BayesFactor package interested in the concreteness effect should maximize the con- were used. As in Experiment 1, the results from Experiment 3 trast between concrete and abstract stimuli and keep the stan- were split into subsets so that the abstract and midscale condi- dard deviations of their stimuli low (below 1) in order to tions would be compared to the concrete condition individually. maximize their chances of detecting an effect. The resulting Bayes factors for each comparison were concrete It might seem curious that, given that other list memory versus abstract, 5.85; concrete versus midscale, 0.47. For the studies have revealed concreteness effects when comparing concrete–abstract comparison, the Bayesian analysis is compa- mostly concrete stimuli with mostly high-standard-deviation rable to the frequentist analysis: A model containing an effect of midscale stimuli, no such effect was obtained in any of the condition is 5.85 times more likely given the data than a model experiments reported here. As I argued when discussing the without this effect, which is quite strong evidence in favor of a Brysbaert et al. (2013) norms, the middle of the concreteness concreteness effect. However, the concrete–midscale analysis scale is marked by a high degree of variability that is difficult was inconclusive. One thing to note is that Experiment 3 featured to interpret. One of the aims of this article was to test the words with similar rates of knowledge to those in Experiments 1 possibility that words that people agree about how to rate are and 2. Experiment 3 produced a concreteness effect, so this easier to remember than words that people disagree about how might partially allay concerns that Experiments 1 and 2 produced to rate. The three experiments reported here do not provide null results because participants did not know the words used. I evidence either way on this point: p values above .05 now turn to a general discussion of these results in light of the (corrected) and Bayes factors between 1/3 and 3 for the con- issues discussed in the introductory section on concreteness crete–midscale comparisons indicate evidence for neither the norms, as well as a consideration of other psycholinguistic var- null nor the alternative hypothesis. The most likely reason for iables (imageability, modality exclusivity norms, and emotional this is a lack of power: The experiments presented here did not valence). feature many stimuli per condition. However, as I will discuss below, this problem is harder to address than might first ap- pear. Furthermore, the abstract conditions in the experiments Table 12 Mean words recalled by condition for Experiment 3 of Romani et al. (2008), Walker and Hulme (1999), Miller and Condition Mean Words Mean Percentage Roodenrys (2009), and Allen and Hulme (2006) were not Recalled (SD) Recalled entirely made up of midscale stimuli. So if there is a concrete- ness effect in list memory experiments, the abstract–concrete Concrete 4.06 (1.31) 67.7% comparisons in these previous experiments would be more Abstract 3.7 (1.25) 61.7% likely to detect it than were the concrete–midscale compari- Midscale 3.85 (1.28) 64.2% sons reported here. 1210 Behav Res (2018) 50:1198–1216 Table 13 Summary of frequentist mixed-effects model for Experiment 3 Fixed Effects Effect Estimate Error df t p Lower 95% Higher 95% CI for Effect CI for Effect Abstract –.37 .12 44.34 –3.11 .003 –.61 –.13 Midscale –.21 .12 44.34 –1.79 .08 –.45 .03 This issue aside, in light of my arguments regarding experience). This is especially significant because it shows Brysbaert et al. (2013), we should probably avoid using that nothing about the Brysbaert et al. (2013) concreteness midscale words on purely theoretical grounds: It is unclear norms is deficient. Instead, the problems I have identified here what an individual concreteness rating is even measuring are general to a whole class of psycholinguistic measures. when it has a high standard deviation. A reviewer raised the Figure 9 presents a mean–standard deviation plot of the point that abstract words tend to have more variable meanings imageability ratings of 6,000 words, amalgamated from two than concrete words, so more variability in their ratings might databases (Cortese & Fugett, 2004;Schocketal., 2012). be expected. This may be true, but I think it somewhat misses Imageability is a measure of how easy it is to generate a men- the point. If there is any point in using the concreteness mea- tal image of the referent of a word, and this variable is so sure (or the other measures I discuss below), we have to take highly correlated with concreteness that the two have often our participants’ ratings seriously. If a word in the middle of been used interchangeably in the literature. the scale has a standard deviation above 1, that means a sig- The distribution is identical to that of the concreteness mea- nificant number of participants judged it to be concrete. Thus, sure. A similar pattern emerges for Lynott and Connell’s there isn’t a basis for putting that word in the Babstract^ cate- (2012) modality exclusivity norm (MEN). MEN essentially gory: It does not make sense to pay attention to only half of the measures the same thing as concreteness, but it provides more participants’ judgments. There is another potential issue, even information because it features ratings for all five primary if we make sure to restrict our Babstract^ stimuli to mean sensory modalities (sight, sound, touch, taste, and smell). A ratings of 2 or below. Typically, concreteness research has low rating indicates that the referent of a word offers little focused on nouns rather than adjectives or verbs. Even starting experience in a given modality; a high rating indicates that a with a set of 40,000 words, the number of nouns in the referent offers a lot of experience. Each word is rated on all Brysbaert et al. (2013) norms that (1) have a mean rating of five modalities. This results in a five-element vector from 2 or below (i.e., are highly abstract), (2) have a standard de- which various measures can be derived (mean sensory expe- viation of 1 or below, and (3) were known by 100% of the rience, maximum sensory experience, Euclidean distance norming population is only 275. Of these, a small but nontriv- from origin, etc.). Figure 10 displays mean–standard deviation ial number are either idiomatic fragments (Bamuck^)ormor- plots of all 400 words in the MEN for the five sensory phologically complex rarities (Bpurposefulness^)thatwe modalities. might be reluctant to include in stimulus lists. In contrast, What is striking here is that even with just 400 words, the 2,888 well-known nouns have mean ratings of 4 or above familiar shape of the distribution is clearly apparent. I do not and standard deviations below 1. I think this fact should also motivate caution concerning the 3.0 utility of the concreteness measure. Ultimately, the measure is supposed to tap into a fundamental, neuropsychologically real 2.5 distinction between different kinds of concepts. It is worrying 2.0 that the rating is only interpretable for a small number of nominal Bconcepts^ at the abstract pole. However, it is still 1.5 the case that Experiment 3 produced a concreteness effect. At the very least, we can say that there is some evidence that 1.0 samples of these Btruly^ abstract words tend to be harder to 0.5 remember than highly concrete words. I turn now to a discussion of other semantic psycholinguis- 0.0 tic variables. The midscale variability problem applies to other variables that measure sensorimotor experience. This is not 12 345 67 surprising, because these variables are derived in much the Mean imageability same way as concreteness (by taking the mean value of a set Fig. 9 Means and standard deviations of imageability ratings for 6,000 of individual judgments about the depth of sensorimotor words (Cortese & Fugett, 2004; Schock et al., 2012) Standard deviation Behav Res (2018) 50:1198–1216 1211 Haptic Visual 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 01 23 45 01 23 45 Auditory Olfactory 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 01 23 45 01 23 45 Gustatory 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Mean rating 01 23 45 Fig. 10 Means and standard deviations of Lynott and Connell’s(2012) modality exclusivity norms think that we can ignore the fact that all of these datasets have Babstractness.^ More worryingly, given that there are relative- the same problematic distribution. This is likely to be a result ly few words in the abstract half of the scale with low standard of the question that we ask participants when we generate deviations, it could be that the concrete–abstract dichotomy is these measures. When we present depth of sensorimotor ex- just not well formed. perience as a scale, we are implicitly committing to the idea Finally, I want to briefly discuss the distribution of emo- that is possible for an entity to be Bhalf-real,^ or Bhalf in tional valence ratings. Emotional valence is different from the space–time,^ or Bhalf-seeable.^ The distributions of these se- sensorimotor variables discussed above, in that it measures a mantic variables tell us that participants tend to reject this idea: completely separate dimension of experience. The standard They do not use midscale values. deviation of an emotional valence rating also takes on a spe- One solution might be to specify explicitly what we want cial importance because of how the scale is constructed. the middle of these scales to represent, and to provide exam- Figure 11 presents the means and standard deviations of the ples of midscale words for participants so that they have some- emotional valence scores from Warriner et al. (2013)(n = thing to anchor their judgments to. Whether something along 13,900). Warriner et al. presented this plot and touched on this these lines would usefully decrease variability in the middle of issue, but they did not raise exactly the same point as the one I the scale is an open question, but a potential issue here is that it want to focus on here. is very difficult (for me) to think of a construct that could serve Recall that a score of 1 indicates extremely negative emo- tional valence, 5 indicates neutrality, and 9 indicates extremely as amidscaleanchorbetween Bconcreteness^ and Standard deviation 1212 Behav Res (2018) 50:1198–1216 All words Conclusion 3.5 3.0 I have argued that there is a problem with the statistical characteristics of various semantic psycholinguistic vari- 2.5 ables (focusing in particular on the concreteness variable). 2.0 In a great number of cases, mean values do not reflect the judgments that actual participants made about a word. 1.5 Furthermore, mean values in the middle of these scales 1.0 are difficult to interpret because it is not clear what prop- erty they indicate. Unfortunately, it appears that in many 0.5 experiments reported throughout the literature on concrete- 0.0 ness effects, many of the stimuli in the abstract conditions 12 345 678 9 are not actually abstract. Instead, they are precisely those Mean emotional valence stimuli for which the mean concreteness value is a bad Fig. 11 Means and standard deviations of Warriner et al.’s(2013) emo- indicator of what participants’ choices were. In two of the tional valence norms new list memory experiments reported here, no concrete- ness effect was obtained when the contrast in concreteness positive emotional valence. Looking at Fig. 11,it shouldbe between conditions was maximized. However, when emo- possible to select unequivocally negative, neutral, and positive tional valence was controlled, a concreteness effect was words for use in experiments: There are some words at mean obtained in Experiment 3. ratings of 1, 5, and 9 with low standard deviations. This is The concreteness effect obtained in Experiment 3 is en- obviously a good thing. couraging, because it allays some of the concerns outlined However, because the middle of this scale is a neutral point above. However, there are still a number of reasons to be between two extremes, words with high standard deviations cautious about concreteness and other related semantic vari- are especially problematic. This is because a 5 is supposed to ables. First, the status of words with high standard deviations indicate emotional neutrality. But if a word has a mean of 4–6 is entirely unclear. These high standard deviations for but a standard deviation of 2 or more, that means that on midscale words might arise at least partially because it is average, participants actually associate moderate to large emo- unintuitive to treat sensorimotor experience as a graded prop- tional responses with that word. Some participants associate erty. Second, only a very small number of Babstract^ nouns positive emotions with the word, but others associate negative have low standard deviations. This calls into question the util- emotions with it. Quite a few words look neutral, but in fact ity of the concreteness–abstractness dichotomy as it is current- are not. A few examples are: ly operationalized. Also, researchers who want to use nominal stimuli or control for word class have very little choice if they Cell: Mean = 4.09, SD =2.69 want to keep the standard deviations of their stimuli low. For Sushi: Mean = 6.25, SD =2.77 the emotional valence measure, I think the picture is some- Gym: Mean = 5.84, SD =2.52 what better. High standard deviations provide meaningful in- formation, although it is perhaps even more important to keep Similarly, if a word has a mean emotional valence of, say, standard deviations low when making comparisons between 3, but a standard deviation above 1.5, that means that some different areas of the scale. people report a very strong negative response to that word, The good news is that the use of new large-scale psycho- whereas some people report little or no emotional response at linguistic databases such as the Brysbaert et al. (2013)con- all. So if a researcher is interested in comparing responses to creteness norms and Warriner et al.’s(2013) emotional va- neutral words with responses to emotionally valenced words, lence norms rather than relatively small, older databases they should definitely avoid words with high standard devia- (Coltheart, 1981) can allow researchers to sidestep the prob- tions for emotional valence, because they will add a signifi- lems I raise completely. This is because the sheer size of these cant amount of noise to the experimental design. One positive datasets allows for the selection of suitable stimuli. thing to note is that for the emotional valence measure, a high standard deviation is potentially problematic but is still interpretable. It makes sense that different people will associ- Author note I thank Robyn Carston and Sebastian Crutch for their invaluable support in the preparation of the manuscript, and Matthew ate different emotions with certain words. It also makes sense Jones for advice about R packages. I also thank three reviewers for advice to think of our emotional responses as graded. I think this is a and constructive criticism. All mistakes are entirely my own. This work key difference between the sensorimotor experience variables was supported by the Economic and Social Research Council (grant and the emotional valence measure. number ES/J500185/1). Standard deviation Behav Res (2018) 50:1198–1216 1213 Appendices Table 14 1. Experiment 1 stimuli List condition Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 1 disagree polling dipstick decade centaur exhaust foreword limbo spender 2 disagree physic sequel deacon nettle output earshot deadline cackle 3 disagree brethren zenith deluge silence lawsuit theorist polka margin 4 disagree nappy degree panic bearings legend request physics prefect 5 disagree sponsor delta dropper phantom egghead rightness aerial eyesight 6 disagree halter brainwave mankind nightlife surname scrounger tunic omen 7 disagree pariah divorce cosmos sundries purveyor demon crosswind alias 8 disagree grammar conveyance easement blackball woodland giantess weeknight instant 9 disagree tidbit shallows photon plural hallmark grafting sandman nature 10 disagree slipstream audit poorhouse minute rival tribune abyss spectrum 11 agree menace bookie tinting flicker rebound squatter tempo pusher 12 agree uprise digest tiling region charmer joyride outbreak nutrient 13 agree hubbub matron median nuthouse pullout partner distaste refill 14 agree burial backwash mover career event footing caper peacetime 15 agree jailbreak torment hazard instinct guru downpour richness glucose 16 agree bunting rhythm stalker dullness ascent headache gunpoint welfare 17 agree ringside archduke turmoil shyness posse gangway shipping outreach 18 agree sunburst mishap bumpkin deceit villain bloodlust misdeed hunting 19 agree diesel roughhouse attempt whiner viewpoint freshness stampede leader 20 agree semblance havoc broadside dining image dissent goner culprit 21 abstract setback vagueness spirit notion loyalty esteem phrasing credence 22 abstract charade rapture betrayal logic backlash renown letdown affront 23 abstract desire mystique intent vantage glory nuance unease motive 24 abstract amends prestige godsend satire leeway wordplay pretense calmness 25 abstract accord whimsy disdain hardship virtue manner regard effect 26 abstract freelance mischief respite folly pureness repute courage meantime 27 abstract merit standpoint future allure rapport wisdom prudence insight 28 abstract mistake quantum dogma function purpose willpower hearsay meaning 29 abstract patience aspect debut fairness pity taboo riddance appeal 30 abstract piety finesse foresight longshot loathing stigma concern control 31 concrete leaflet roadhouse artist lighting parsley seabed ironwork lacrosse 32 concrete clipper pewter cauldron quarry blockade earwig clubfoot logbook 33 concrete summit breeches abscess foreman award entree funnel beacon 34 concrete corset template pigment fuchsia urchin ringworm crewman mansion 35 concrete jester gasket sternum backdrop bouncer chapel resort county 36 concrete penthouse fracture entrails vinyl buckskin tundra barrier plumbing 37 concrete timepiece methane record tiller grindstone merchant shrapnel duchess 38 concrete quarter bulkhead sarong tenant chamber canon bailiff machine 39 concrete beaker clinic tango clothing amber jackal roulette survey 40 concrete spiral marrow billiard bootlace scabies saffron captain product 1214 Behav Res (2018) 50:1198–1216 Table 15 2. Experiment 2 stimuli Pair Condition Word 1 Word 2 1 concrete cauldron hike 2 concrete footman band 3 concrete blazer creature 4 concrete rubble liqueur 5 concrete throttle ulcer 6 concrete ranch gauntlet 7 concrete cadet concert 8 concrete ledge manor 9 abstract betrayal urge 10 abstract revenge foresight 11 abstract godsend risk 12 abstract wisdom psyche 13 abstract hardship malice 14 abstract greed riddance 15 abstract loyalty lenience 16 abstract bliss mercy 17 midscale genius royalty 18 midscale foreground district 19 midscale gleam patriot 20 midscale view approach 21 midscale upstart brawn 22 midscale expanse profit 23 midscale asset vortex 24 midscale habit encore Table 16 3. Experiment 3 stimuli List Number Condition Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 1 concrete pad harpoon stretcher kennel ulcer aftershave 2 concrete trachea parsley fuselage rifleman plaster medallion 3 concrete cedar rubble trinket composer liver dormitory 4 concrete scale shipment gladiator guesthouse morgue marrow 5 concrete vineyard porcelain cocktail warship advisor slate 6 concrete supervisor infirmary bouquet manicure bay tomb 7 concrete graphics sage smoothie wildfire prosecutor sapphire 8 concrete inspector minefield tourist stub horseradish frostbite 9 concrete guitarist notch gauntlet orphanage vegetation bomber 10 concrete greenhouse sedative museum silicon wreckage accountant 11 concrete incubator lavender surgeon violinist courtroom embroidery 12 concrete landlord measles dictator pacemaker minibus plumber 13 concrete newsletter bodyguard stockbroker foliage petroleum liqueur 14 concrete plantation attorney blockade antibiotic concert currency 15 concrete stroke titanium bile sniper massage adhesive 16 abstract urge renown patience motive malice quandary 17 abstract penance belief indulgence reproach version fixation 18 abstract mercy glory charade aptitude manner formality 19 abstract risk psyche rhetoric foresight fraud regard 20 abstract prudence oblivion hardship mood sarcasm fate 21 abstract extent imposition purpose competence luck whim 22 abstract willpower bias indecision loyalty seriousness knowledge 23 abstract involvement existence coincidence ruse principles betrayal 24 abstract detriment subtlety tradition damnation wisdom fantasy 25 abstract forgiveness semantics value sanctity godsend discretion 26 abstract eternity politeness concept reasoning anomaly symbolism 27 abstract suspicion goodness arrogance mortality chance theory 28 abstract precedent privacy likelihood lunacy oversight revenge 29 abstract affirmative repentance leniency similarity merit expertise 30 abstract wickedness analogy bliss coercion courage avoidance 31 midscale plot molecule mankind format swindle motherland 32 midscale hormone reply tarot tribune routine pushover 33 midscale delay gossip slumber bandwagon response vigilante Behav Res (2018) 50:1198–1216 1215 Table 16 (continued) List Number Condition Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 34 midscale zone shallows pinnacle wavelength grief degree 35 midscale envoy character fallout clue vacancy tone 36 midscale circulation drunkenness midsummer doctorate goal hoax 37 midscale cutthroat rift corporation lawsuit translation sweetness 38 midscale announcement activist process slack formation whiplash 39 midscale chronicle monologue overlap motherhood virus penalty 40 midscale exhaustion delegate magic rebuttal crackpot diversion 41 midscale entirety ugliness factor ancestry confidant purgatory 42 midscale engagement accident insomnia regulator utility egghead 43 midscale repellent takeover provision dioxide offence thinker 44 midscale equivalent oracle ignition visibility ransom narrative 45 midscale sense extremity content lunatic divorce casualty Open Access This article is distributed under the terms of the Creative Crutch, S. J., & Warrington, E. K. (2005). Abstract and concrete concepts Commons Attribution 4.0 International License (http:// have structurally different representational frameworks. Brain, 128, creativecommons.org/licenses/by/4.0/), which permits unrestricted use, 615–627. doi:10.1093/brain/awh349 distribution, and reproduction in any medium, provided you give de Groot, A. M. (1989). Representational aspects of word imageability appropriate credit to the original author(s) and the source, provide a link and word frequency as assessed through word association. Journal to the Creative Commons license, and indicate if changes were made. of Experimental Psychology: Learning, Memory, and Cognition, 15, 824–845. doi:10.1037/0278-73126.96.36.1994 Dhond, R. P., Witzel, T., Dale, A. M., & Halgren, E. (2007). Spatiotemporal cortical dynamics underlying abstract and concrete References word reading. Human Brain Mapping, 28, 355–362. doi:10.1002/ hbm.20282 Gee, N. R., Nelson, D. L., & Krawczyk, D. (1999). Is the concreteness Allen, R., & Hulme, C. (2006). Speech and language processing mecha- effect a result of underlying network interconnectivity? Journal of nisms in verbal serial recall. Journal of Memory and Language, 55, Memory and Language, 40, 479–497. doi:10.1006/jmla.1998.2627 64–88. doi:10.1016/j.jml.2006.02.002 Huang, H.-W., Lee, C.-L., & Federmeier, K. D. (2010). Imagine that! Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., ERPs provide evidence for distinct hemispheric contributions to Loftis, B.,…Treiman, R. (2007). The English Lexicon Project. the processing of concrete and abstract concepts. NeuroImage, 49, Behavior Research Methods, 39, 445–459. doi:10.3758/ 1116–1123. doi:10.1016/j.neuroimage.2009.07.031 BF03193014 Jager, B., & Cleland, A. A. (2016). Polysemy advantage with abstract but Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear not concrete words. Journal of Psycholinguistic Research, 45, 143– mixed-effects models using lme4. Journal of Statistical Software, 156. doi:10.1007/s10936-014-9337-z 67, 1–48. doi:10.18637/jss.v067.i01 James, C. T. (1975). The role of semantic information in lexical decisions. Begg, I. (1972). Recall of meaningful phrases. Journal of Verbal Journal of Experimental Psychology: Human Perception and Learning and Verbal Behavior, 11, 431–439. doi:10.1016/S0022- Performance, 1, 130–136. doi:10.1037/0096-15188.8.131.52 537180024-0 Kounios, J., & Holcomb, P. J. (1994). Concreteness effects in semantic Binder, J. R., Westbury, C. F., McKiernan, K. A., Possing, E. T., & processing: ERP evidence supporting dual-coding theory. Journal of Medler, D. A. (2005). Distinct brain systems for processing concrete Experimental Psychology: Learning, Memory, and Cognition, 20, and abstract concepts. Journal of Cognitive Neuroscience, 17, 905– 804–823. doi:10.1037/0278-73184.108.40.2064 Kousta, S.-T., Vigliocco, G., Vinson, D. P., Andrews, M., & Del Campo, Bleasdale, F. A. (1987). Concreteness-dependent associative priming: E. (2011). The representation of abstract words: Why emotion mat- Separate lexical organization for concrete and abstract words. ters. Journal of Experimental Psychology: General, 140, 14–34. Journal of Experimental Psychology: Learning, Memory, and doi:10.1037/a0021446 Cognition, 13, 582–594. doi:10.1037/0278-73220.127.116.112 Kousta, S.-T., Vinson, D. P., & Vigliocco, G. (2009). Emotion words, Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). The regardless of polarity, have a processing advantage over neutral impact of word prevalence on lexical decision times: Evidence from words. Cognition, 112, 473–481. doi:10.1016/j.cognition.2009.06. the Dutch Lexicon Project 2. Journal of Experimental Psychology: Human Perception and Performance, 42, 441–458. doi:10.1037/ Kroll, J., & Merves, J. (1985). Lexical access for concrete and abstract xhp0000159 words. Journal of Experimental Psychology: Learning, Memory, Brysbaert, M., Warriner, A. B., & Kuperman, V. (2013). Concreteness and Cognition, 12, 92–107. doi:10.1037/0278-7318.104.22.168 ratings for 40 thousand generally known English word lemmas. Kruschke, J. K. (2011). Bayesian assessment of null values via parameter Behavior Research Methods, 46, 904–911. doi:10.3758/s13428- estimation and model comparison. Perspectives on Psychological 013-0403-5 Science, 6, 299–312. doi:10.1177/1745691611406925 Coltheart, M. (1981). The MRC psycholinguistic database. Quarterly Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age- Journalof ExperimentalPsychology, 33A, 497–505. doi:10.1080/ of-acquisition ratings for 30,000 English words. Behavior Research 14640748108400805 Methods, 44, 978–990. doi:10.3758/s13428-012-0210-4 Kuznetsova, A., Brockhoff, P. B., & Christensen, H. B. (2015). lmerTest: Cortese, M. J., & Fugett, A. (2004). Imageability ratings for 3,000 mono- syllabic words. Behavior Research Methods, Instruments, & Tests in linear mixed effect models (Software). Retrieved from Computers, 36, 384–387. doi:10.3758/BF03195585 http://cran.r-project.org/package=lmerTest 1216 Behav Res (2018) 50:1198–1216 Lee, C., & Federmeier, K. D. (2008). To watch, to see, and to differ: An Romani, C., McAlpine, S., & Martin, R. (2008). Concreteness effects in different tasks: Implications for models of short-term memory. event-related potential study of concreteness effects as a function of word class and lexical ambiguity. Brain and Language, 104, 145– Quarterly Journal of Experimental Psychology, 61, 292–323. doi: 158. doi:10.1016/j.bandl.2007.06.002 10.1080/17470210601147747 Lynott, D., & Connell, L. (2012). Modality exclusivity norms for 400 Sabsevitz, D. S., Medler, D. A., Seidenberg, M., & Binder, J. R. (2005). nouns: The relationship between perceptual experience and surface Modulation of the semantic system by word imageability. word form. Behavior Research Methods, 45, 516–526. doi:10.3758/ NeuroImage, 27, 188–200. doi:10.1016/j.neuroimage.2005.04.012 s13428-012-0267-0 Sadoski,M.,Kealy,W.A.,Goetz,E.T.,&Paivio,A.(1997). Marschark, M., & Hunt, R. R. (1989). A reexamination of the role of Concreteness and imagery effects in the written composition of def- imagery in learning and memory. Journal of Experimental initions. Journal of Educational Psychology, 89, 518–526. doi:10. Psychology, 15, 710–720. 1037/0022-0622.214.171.1248 Miller, L. M., & Roodenrys, S. (2009). The interaction of word frequency Schock, J., Cortese, M. J., & Khanna, M. M. (2012). Imageability esti- and concreteness in immediate serial recall. Memory & Cognition, mates for 3,000 disyllabic words. Behavior Research Methods, 44, 37, 850–865. doi:10.3758/MC.37.6.850 374–379. doi:10.3758/s13428-011-0162-0 Morey, R. D., Rouder, J. N., & Jamil, T. (2015). BayesFactor: Skipper-Kallal, L. M., Mirman, D., & Olson, I. R. (2015). Converging Computation of Bayes factors for common designs (version evidence from fMRI and aphasia that the left temporoparietal cortex 0.9.12–2). Retrieved from https://rdrr.io/cran/BayesFactor/ has an essential role in representing abstract semantic knowledge. Nelson, D. L., & Schreiber, T. A. (1992). Word concreteness and word Cortex, 69, 104–120. doi:10.1016/j.cortex.2015.04.021 structure as independent determinants of recall. Journal of Memory ter Doest, L., & Semin, G. (2005). Retrieval contexts and the concreteness and Language, 31, 237–260. doi:10.1016/0749-596X(92)90013-N effect: Dissociations in memory for concrete and abstract words. Paivio, A., Khan, M., & Begg, I. (2000). Concreteness and relational European Journal of Cognitive Psychology, 17, 859–881. doi:10. effects on recall of adjective-noun pairs. Canadian Journal of 1080/09541440540000031 Experimental Psychology, 54, 149–160. van Casteren, M., & Davis, M. H. (2007). Match: A program to assist in Paivio, A., Walsh, M., & Bons, T. (1994). Concreteness effects on mem- matching the conditions of factorial experiments. Behavior ory: When and why? Journal of Experimental Psychology: Research Methods, 39, 973–978. doi:10.3758/BF03192992 Learning, Memory, and Cognition, 20, 1196–1204. doi:10.1037/ Walker, I., & Hulme, C. (1999). Concrete words are easier to recall than 0278-73126.96.36.1996 abstract words: Evidence for a semantic contribution to short-term Paivio, A., Yuille, J. C., & Madigan, S. A. (1968). Concreteness, imagery, serial recall. Journal of Experimental Psychology: Learning, and meaningfulness values for 925 nouns. Journal of Experimental Memory, and Cognition, 25, 1256–1271. doi:10.1037/0278-7393. Psychology, 76(1, Pt. 2), 1–25. doi:10.1037/h0025327 25.5.1256 Pexman,P.M., Hargreaves,I. S.,Edwards,J.D., Henry,L.C.,& Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, Goodyear, B. G. (2007). Neural Correlates of Concreteness in arousal, and dominance for 13,915 English lemmas. Behavior Semantic Categorization. Journal of Cognitive Neuroscience, 19, Research Methods, 45, 1191–1207. doi:10.3758/s13428-012-0314-x 1407–1419. doi:10.1162/jocn.2007.19.8.1407
Behavior Research Methods – Springer Journals
Published: Jul 13, 2017
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera