The Empirical Validity of the Common European Framework of Reference Scales. An Exemplary Study for the Vocabulary and Fluency Scales in a Language Testing Context

Katrin, Wisniewski,

doi:10.1093/applin/amw057

The Empirical Validity of the Common European Framework of Reference Scales. An Exemplary Study for the Vocabulary and Fluency Scales in a Language Testing Context

Katrin, Wisniewski, 2018-12-01 00:00:00 Abstract In spite of the widespread use of Common European Framework of Reference for language learning, teaching, and assessment (CEFR) scales, there is an overwhelming lack of evidence regarding their power to describe empirical learner language (Fulcher 2004; Hulstijn 2007). This article presents results of a study that focused on the empirical robustness (i.e. the power of level descriptions to capture what learners actually do in a language test) of the CEFR vocabulary & fluency scales (A2-B2). Data stem from an Italian & German oral proficiency test (Abel et al. 2012). Results show that the empirical robustness was flawed: Some scale contents were hardly observable or so evenly distributed that they could not distinguish between learners. Contradictory/weak correlations among scale features and heterogeneous cluster solutions suggest that scales did not consistently capture typical learner behaviour. Often, learner language could not be objectively described by any level description. Also, it was only partially possible to link scale contents to research-based measures of fluency and vocabulary. Given the importance of CEFR levels in many high-stakes contexts, the results suggest the need of a large empirical validation project. INTRODUCTION The Common European Framework of Reference for language learning, teaching, and assessment (CEFR, CoE 2001) has become the most important yardstick for the development of language tests, curricula, educational standards, and textbooks in Europe. There is virtually no (European) high-stakes test not related to the framework. Many crucial decisions about people’s lives are based on CEFR level descriptions. Thus, even if the scale system is only one component of the CEFR, which as a whole takes on a much broader educational and political perspective, it is of fundamental importance to deliver evidence for the possibility to arrive at fair descriptions of learner language with the help of the scales. While there are many open questions related to the CEFR scales (e.g. their relationship to second language acquisition and their theoretical foundations, including the concept of proficiency they are built upon) that are not discussed here, this article focuses on some aspects of empirical validity, roughly understood here as the usefulness of CEFR descriptors to describe authentic learner language. Until now, there is hardly any evidence for the empirical validity of the CEFR scales (Fulcher 2004; Alderson 2007; Hulstijn 2007; Little 2007; Hulstijn et al. 2010, Wisniewski 2013, 2014). As this insecurity is a consequence of the methodology chosen to calibrate the CEFR scales, the article will first briefly sum up the main steps of the scaling process. In that complex procedure, which lay great emphasis on the perceptions of human raters, descriptor contents were not matched onto empirical learner language so that it is unclear if and to what degree CEFR descriptors capture what (different) learners do (in differing contexts, tasks, and target languages). This article aims at tackling the problem of empirical validity of the A2-B2 level descriptions of three selected CEFR scales, that is, vocabulary range and control and spoken fluency, in an exemplary context of an oral proficiency test for German and Italian as L2. Its aim is to examine whether in this context, the level descriptions are empirically robust. Level descriptions are tentatively considered empirically robust in a given context if the concepts they contain are reliably observable in learner texts, and if they allow to group L2 productions coherently and unambiguously into CEFR levels. It is considered a further argument in favour of scale validity if the descriptor contents can be empirically related to research-based measures of the underlying constructs of fluency and vocabulary range and control. In order to establish a direct relationship between learner texts and the CEFR scales and to avoid circular argumentation, all analyses are carried out without taking into consideration human ratings. CEFR SCALES AND EMPIRICAL VALIDITY ISSUES A central point of uncertainty about the empirical validity of the CEFR scales regards their questionable adequateness for the description of learner language. This problem results from the scale calibration methodology employed in the so-called ‘Swiss Project’ (1993–1996; North 2000, 2014b: 14–21; Schneider and North 2000; Council of Europe (CoE) 2001: 217–25). The procedure can be roughly summed up as follows: In a first, intuitive project stage, approximately 2,000 descriptors of L2 competence were pooled. These ‘can-do statements’ were collected from diverse tests of English as an L2, and extracted from different types of scales. The qualitative, second project stage involved 300 Swiss teachers who in 32 workshops sorted the descriptors according to perceived categories of L2 proficiency. The descriptors they best agreed upon were then used used to assess (oral) language productions. In a third, quantitative step, Multi-Facet Rasch Analysis (CoE 2004: Section H) was applied to calibrate descriptors with sufficient statistical qualities (reliability) on one common logit scale. That scale was subdivided into the now well-known six common reference levels of the CEFR (A1-A2-B1-B2-C1-C2, Schneider and North 2000:153); an exemplary scale is presented in Table 1 Table 1: Vocabulary range scale of the CEFR (CoE 2001: 112) C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning. C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms. B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can vary formulations to avoid frequent repetitions, but lexical gaps can still cause hesitation and circumlocution. B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to his/her everyday life such as family, hobbies and interests, work, travel, and current events. A2+ Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics. A2 Has a sufficient vocabulary for the expression of basic communicative needs. Has a sufficient vocabulary for coping with simple survival needs. A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete situations. C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning. C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms. B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can vary formulations to avoid frequent repetitions, but lexical gaps can still cause hesitation and circumlocution. B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to his/her everyday life such as family, hobbies and interests, work, travel, and current events. A2+ Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics. A2 Has a sufficient vocabulary for the expression of basic communicative needs. Has a sufficient vocabulary for coping with simple survival needs. A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete situations. Table 1: Vocabulary range scale of the CEFR (CoE 2001: 112) C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning. C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms. B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can vary formulations to avoid frequent repetitions, but lexical gaps can still cause hesitation and circumlocution. B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to his/her everyday life such as family, hobbies and interests, work, travel, and current events. A2+ Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics. A2 Has a sufficient vocabulary for the expression of basic communicative needs. Has a sufficient vocabulary for coping with simple survival needs. A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete situations. C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning. C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms. B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can vary formulations to avoid frequent repetitions, but lexical gaps can still cause hesitation and circumlocution. B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to his/her everyday life such as family, hobbies and interests, work, travel, and current events. A2+ Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics. A2 Has a sufficient vocabulary for the expression of basic communicative needs. Has a sufficient vocabulary for coping with simple survival needs. A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete situations. In this scaling approach, teacher decisions are of central importance. They are used as data in the statistical procedure (the Rasch model) which serves as an arbiter in determining descriptor quality (Fulcher et al. 2011: 7). The vertical dimension of the CEFR scales thus mirrors ‘scaled teacher perceptions’ (North 2014b: 23). The scales contain only descriptors the difficulty of which teachers could reliably agree upon, so that they are built completely on practitioners’ beliefs. Although it is important for a scale to be plausible for its users, this mono-dimensional scaling methodology potentially threatens validity, and for several reasons. What is most important here is the fact that no learner language was analysed to examine whether the descriptors can be meaningfully applied to authentic data (Tschirner 2005: 55; Hulstijn 2007; Little 2007: 648; Fulcher 2008). Without the link to empirical learner language, though, the usefulness of the descriptors is at stake: ‘If descriptors are to be meaningful characterizations of ability, then they should be able to be related to actual performance.’ (Alderson 1991: 74). A second drawback is the fact that the scaling procedure was not construct-driven, that is, descriptors were not derived from models of language ability or theories of second language acquisition, and no empirical SLA findings were integrated. However, an attempt at relating the horizontal scale categories (e.g. fluency, grammatical accuracy) to taxonomies of proficiency models was made (North 1997; North 2000: 20; 123–9; North 2014b: 22–5). The potential threat to theoretical scale validity is not discussed further here for reasons of space (see, e.g. North 2014b: 22–5; Hulstijn 2007; Wisniewski 2014).1 Third, it is a benefit of the CEFR scales to be usable in a reliable way by raters. However, this reliability is only meaningful if evidence also suggests that raters actually refer to the scale contents in decision-making. Research on the validity of rating behaviour has long revealed many problems, even with trained raters (e.g. Eckes 2008; Wisniewski 2010, 2014; Alderson and Kremmel 2013; Kuiken and Vedder 2014). Hence, when analysing CEFR-based ratings, validity is not to be confused with reliability, the former being still in need of much more research attention. VALIDATION APPROACH Validity is a complex concept. In language assessment, it is not considered a property of a language test, but ‘(…) an integrated evaluative judgment of the degree to which empirical evidences and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment’ (Messick 1989: 13). As a consequence, validity arguments in the field of language testing normally focus on delivering evidence for the appropriateness of interpretations and consequences based on test scores (Messick 1989; Chapelle 1999; Kane 2001; Bachman and Palmer 2010), whereas rating scale validity aspects are rarely analysed in their own right (exceptions are Harsch 2005; Knoch 2007, 2009, 2011). Regardless of that, the CEFR scales have assumed a peculiar role that could be defined as ‘pseudo constructs’, thus contributing considerably to the overall validity of test interpretations (McNamara et al. 2002; Knoch 2007). This underscores the importance of specific CEFR scale validation studies. Surprisingly, to the knowledge of the author there are no validation studies that analyse learner language through the lense of operationalized CEFR scales. Instead, an increasing number of research projects aims at illustrating the meaning of CEFR levels by looking for linguistic correlates of ratings in learner texts (e.g. Bartning et al. 2010; Harrison and Barker 2015; for a discussion, see Wisniewski forthcoming). This methodology must be clearly distinguished from attempts at scale validation like the one presented here. In the ‘criterial feature’ perspective (e.g. Hawkins and Filipovíc 2012), CEFR scales are not questioned for validity. Rather, CEFR-based ratings are used as pre-classifications of learner texts (most often stemming from language tests) on the basis of which linguistic features discriminating between CEFR levels, or rather, between CEFR-based ratings, are searched. This ‘criterial features’ approach is extremely helpful for the validation of ratings, while it does not directly regard the relationship between CEFR scales and learner language, that is, their empirical validity.2 The present study relies on a newly developed approach for empirical CEFR scale validation some elements of which are explained in more detail here. One important premise was the avoidance of human ratings, which in any scale validation effort are problematic. As mentioned above, even professional ratings are known to be biased (Eckes 2008; Wisniewski 2010, 2014; Alderson and Kremmel 2013; Kuiken and Vedder 2014). Even very reliable ratings need not be valid, that is, raters do not necessarily have to refer to the rating instruments at hand in order to arrive at unanimous decisions. Therefore, the validation approach used here works with operationalized contents from CEFR descriptors to analyse learner language directly through the lense of the scales. To understand the focus of this validation study, it is important to remember that CEFR scales are not meant to be used for language testing in their published form. Any testing context requires modifications to adapt the scales to specific test purposes and users (Alderson 1991; Harsch 2005; North 2014a: 244). However, independently of concrete language tests (or any other context), and before putting modified scale versions to use, early on we would want to be sure that what is already there is actually relatable to learner language, that is, empirically robust. The validation approach used here focuses on this very fundamental element of validity. However, even if empirical robustness evidence could be provided, this would still be no guarantee for the development of a concrete valid test as a whole. Empirical robustness evidence can only support the assumption that it is in principle possible to use the CEFR scales in a valid way. Furthermore, validation studies necessarily have to limit themselves to a specific language elicitation context. Learner language is produced in an immense variety of contexts, so that it is crucial to specify what we can expect CEFR scale contents to be relatable to. Theoretically, chapter five scales like the ones considered here claim to be connected to underlying constructs of ‘communicative language competence’ (North 2000: 20, 28; CoE 2001). Therefore, they should be applicable independently of concrete (test) tasks (‘context-free’, see Hudson 2005: 209–10) and relevant to a large number of contexts (Fulcher 1996: 44; North 2007: 658). Here, CEFR descriptors are applied to two oral proficiency test tasks thoroughly related to the CEFR (see below for details). This is considered a typical, widely encountered use context of CEFR-based learner language classification. It is exactly in this type of context that (modified) CEFR scales are regularly used, often with serious consequences for test takers’ lives. However, this is but one of a great number of possible contexts. Furthermore, it is not the aim to shed light on the relationship between language proficiency as a whole, or of second language development, and the CEFR scales, where much more and different types of data would be needed. Apart from the restricted sample size that does not allow for generalizations, another constraint of the study is the focus on empirical robustness alone. In a more comprehensive validation approach, the theoretical foundations of the scales would have to be questioned as well. Another aspect not addressed here, but strongly intertwined with empirical robustness, is the question whether we can rely on the validity of human ratings as reflecting rating scale contents. These aspects are both discussed in Wisniewski (2014). METHODOLOGY To address the empirical validity of the CEFR vocabulary and fluency scales (A2-B2, Italian (ITA) and German (GER) as L2) in a language test, the empirical robustness concept was broken down into the three following main assumptions: If the CEFR scales claim empirical validity, (i) the concepts mentioned in the level descriptions must be observable. These concepts are not spelled out in the CEFR so that they need to be operationalized carefully. Level descriptions are usually based on more than one such concept (e.g. the number and severeness of lexical errors), and for each concept, a prediction is made regarding its quantity (e.g. ‘good/sufficient/basic range of vocabulary’) and/or its more qualitative contextual framing (e.g. ‘vocabulary sufficient for survival needs’). A second prerequisite for empirical validity is the possibility to clearly (ii) group individual learner texts with the help of these predictions. Third, it would strengthen the validity if (iii) links between scale contents and research-based construct measures were shown. In addition, no greater differences between the target languages (ITA/GER) and the tasks should be found. An oral proficiency test with 98 South Tyrolean 17–18-year-old high-school pupils (L1 GER/ITA, L2 ITA/GER) was carried through in a multimethod large-scale language assessment project (Abel et al. 2012). All participants had comparable L2 learning backgrounds. In the test, the participants did a monologue task for which they had to choose one out of four pictures each showing a person. They were asked to describe what they see and to say most they could about the life of the person they had chosen. They had three minutes to prepare themselves. The instructions gave some hints at possible topics. The interviewer was advised to intervene only in case of long silent stretches to suggest further aspects participants could talk about. In the dialogue task, test takers had to choose one of two topics (reality TV shows and horoscopes). They were asked to describe their personal experience and then to critically discuss the topic with the interviewer. Again, they had three minutes to prepare themselves. The total duration was of approximately 15 minutes. All productions were rated independently by two trained raters. Quality standards in the test development, administration, and evaluation process were respected (Bachman and Palmer 1996, 2010; CoE 2004, 2009/2003; Fulcher 2003; see report on full project, Abel et al. 2012). Out of this database, 19 productions each containing two tasks were chosen for this study. Selected candidates produced a sufficient amount of clearly intelligible speech. Additionally, they received highly reliable ratings for vocabulary and fluency. Their rating profiles were flat, that is, they had similar CEFR ratings for the rating criteria applied (see Wisniewski 2014 for details). For the analyses, audio recordings were transcribed in the multi-layer standoff editor ELAN in CHAT. Then, the A2-B2 level descriptions were operationalized (see appendix, Table A1) in order to establish a direct link between CEFR scales and learner language. Unavoidably, this required interpretation. Subjective (e.g. ‘regular interaction with native speakers quite possible’, fluency, B2) and self-referential aspects (e.g. ‘can interact with a degree of fluency…’, fluency, B2) were excluded. In case of translation ambiguity, the English CEFR version was used. At times, several interpretations were plausible; then, all were operationalized and cross-checked. In some cases, ‘inverted’ concepts were used in order to make them more easily comparable across levels. If, for example, the scale claimed that it was typical for a learner not to show breakdowns in communication, the scale variable would count those breakdowns (normalized, i.e. per utterance and word token). Operationalized descriptor contents were termed ‘scale variables’. Many of them are uncommon, subjective, and/or hard to reliably observe, but they reflect the scale contents as directly as possible. One exemplary level operationalization is described in Wisniewski (2013); the complete rationale can be found in Wisniewski (2014). In addition to the scale variables, a considerable number of research-based indicators were used (e.g. lexical density and sophistication measures, mean length of runs, phonation-time-ratio and many others; see appendix, Tables A2 and A3). The annotation was carried through by two coders independently (inter-rater reliability: Pearson’s contingency coefficient C =0.899, Cohen’s Kappa κ =0.773). For all annotations the coders had not agreed upon, a consensus was formed. The corpus was manually segmented into syllables, and WordSmith and the Stuttgart TreeTagger (Schmid 1994) were used for tokenization and lemmatization. The fully annotated corpus is available on a DVD from the author. A variety of statistical procedures was used. Descriptive statistics was employed to evaluate (i) the observability of scale variables. Relative frequencies were calculated with regard to appropriate units of reference, for example, word tokens, AS-units (Foster, Tonkyn, and Wigglesworth 2000), or syllables. To assess (ii) the consistency of level descriptions, correlations among scale variables were calculated (Pearson’s r). Cluster analyses helped group candidates according to operationalized scale contents (Ward and k-means methods). Cluster analysis is a statistical procedure that detects structures in data without pre-classifications (such as human ratings) with regard to pre-defined variables (in this case, scale variables per level and scale). It groups cases (here: learner texts) most similar to each other in terms of scale variables of a level description. In other words, these groups (termed ‘CEFR clusters’) can be understood as an application of the operationalizable parts of the scale contents. The statistical quality of cluster solutions is determined by the (minimized) variance inside the clusters as compared to the variance in the whole population. However, the content plausibility of the cluster solutions plays an equally important role for the present study as will become clear below. Discriminant analyses delivered Wilks’ Lambda (Λ), an inverse measure of variance that estimates the explanatory power of the scale variables in separating clusters from one another. Wilks’ Lambda helps understand how easy it would be to distinguish candidates with the help of the scale variables alone. (iii) To link scale variables to underlying constructs and to search for evidence of the construct-relevance of the clusters, in a first step correlations between the scale variables and research-based measures were calculated. Furthermore, T-tests and discriminant analyses were run to analyse whether the clusters (see ii, above) were relatable to more established measures of the relevant constructs. RESULTS Observability of scale variables Scale variables were generally easier to observe in the dialogue task (T2), maybe due to its construction or simply to the fact that it elicited more language (Table 2). Table 2: Differences in task length Length T1 SD T1 T2 SD T2 Mean length of production, tokens 160.39 62.39 330.26 112.57 Mean length of productions, minutes 4.1 .78 7.28 1.02 Length T1 SD T1 T2 SD T2 Mean length of production, tokens 160.39 62.39 330.26 112.57 Mean length of productions, minutes 4.1 .78 7.28 1.02 Notes: SD = standard deviation, T = task. Table 2: Differences in task length Length T1 SD T1 T2 SD T2 Mean length of production, tokens 160.39 62.39 330.26 112.57 Mean length of productions, minutes 4.1 .78 7.28 1.02 Length T1 SD T1 T2 SD T2 Mean length of production, tokens 160.39 62.39 330.26 112.57 Mean length of productions, minutes 4.1 .78 7.28 1.02 Notes: SD = standard deviation, T = task. T1 elicited rather short answers, which is not uncommon in language testing. The task was constructed following the example of the Swiss Project. Fluency Some aspects of the fluency scale were hardly observable, above all false starts and incomprehensible utterances (Table 3; appendix, Table A1). Table 3: Observability of fluency scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Pauses A2-B2 54.86 13.07 88.24 24.36 False starts A2 0.42 0.6 1.47 1.6 ‘B1 pauses’ B1 7.26 6.76 16.68 8.64 Incomprehensible utterances B1 0.43 0.67 3.53 2.78 ‘B2 pauses’ B2 5.37 5.87 13.16 7.65 Long pauses B2 4 2.93 3.05 3.01 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Pauses A2-B2 54.86 13.07 88.24 24.36 False starts A2 0.42 0.6 1.47 1.6 ‘B1 pauses’ B1 7.26 6.76 16.68 8.64 Incomprehensible utterances B1 0.43 0.67 3.53 2.78 ‘B2 pauses’ B2 5.37 5.87 13.16 7.65 Long pauses B2 4 2.93 3.05 3.01 Notes: High SD values to be expected for different proficiency levels; all values per production. Table 3: Observability of fluency scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Pauses A2-B2 54.86 13.07 88.24 24.36 False starts A2 0.42 0.6 1.47 1.6 ‘B1 pauses’ B1 7.26 6.76 16.68 8.64 Incomprehensible utterances B1 0.43 0.67 3.53 2.78 ‘B2 pauses’ B2 5.37 5.87 13.16 7.65 Long pauses B2 4 2.93 3.05 3.01 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Pauses A2-B2 54.86 13.07 88.24 24.36 False starts A2 0.42 0.6 1.47 1.6 ‘B1 pauses’ B1 7.26 6.76 16.68 8.64 Incomprehensible utterances B1 0.43 0.67 3.53 2.78 ‘B2 pauses’ B2 5.37 5.87 13.16 7.65 Long pauses B2 4 2.93 3.05 3.01 Notes: High SD values to be expected for different proficiency levels; all values per production. Table 4: Observability of vocabulary range scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Communication problems regarding communicative needs/survival needs A2 0.11 0.31 0 0 Communication problems in basic/everyday life communication B1 0.21 0.54 0.58 0.84 Circumlocutions B1, B2 0.16 0.37 1 1.41 Communication problems regarding own field/general topics B2 0.37 0.76 1.21 1.27 Pauses for lexical planning B2 4.65 4.95 12.68 7.99 Repetitions B2 5.47 3.47 13.26 9.5 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Communication problems regarding communicative needs/survival needs A2 0.11 0.31 0 0 Communication problems in basic/everyday life communication B1 0.21 0.54 0.58 0.84 Circumlocutions B1, B2 0.16 0.37 1 1.41 Communication problems regarding own field/general topics B2 0.37 0.76 1.21 1.27 Pauses for lexical planning B2 4.65 4.95 12.68 7.99 Repetitions B2 5.47 3.47 13.26 9.5 Notes: High SD values to be expected for different proficiency levels; all values per production. Table 4: Observability of vocabulary range scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Communication problems regarding communicative needs/survival needs A2 0.11 0.31 0 0 Communication problems in basic/everyday life communication B1 0.21 0.54 0.58 0.84 Circumlocutions B1, B2 0.16 0.37 1 1.41 Communication problems regarding own field/general topics B2 0.37 0.76 1.21 1.27 Pauses for lexical planning B2 4.65 4.95 12.68 7.99 Repetitions B2 5.47 3.47 13.26 9.5 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Communication problems regarding communicative needs/survival needs A2 0.11 0.31 0 0 Communication problems in basic/everyday life communication B1 0.21 0.54 0.58 0.84 Circumlocutions B1, B2 0.16 0.37 1 1.41 Communication problems regarding own field/general topics B2 0.37 0.76 1.21 1.27 Pauses for lexical planning B2 4.65 4.95 12.68 7.99 Repetitions B2 5.47 3.47 13.26 9.5 Notes: High SD values to be expected for different proficiency levels; all values per production. Other phenomena were not as salient as suggested by the scale: the lexical and grammatical planning and repair pauses defined typical for level B1 and B2, respectively (Table A1, appendix), represented less then a fifth of all learners’ pauses (B1: mean 16.29%, SD 10.78; B2: mean 13.1%, SD 9.99). Furthermore, assigning pausing reasons was problematic and speculative: half of all pauses (55.45%, SD 10.68) remained unexplained, and for the rest, inter-coder reliability was relatively low (C = 0.936 but κ =0.51). The B1 level assumption that the number of pauses is dependent on the length of the run proved a general tendency in the sample. Almost all learners paused considerably more in long utterances (mean of 1.85 times more pauses, SD 0.32). Lower utterance fluency in longer runs, then, might be due to general cognitive speech production processes. Vocabulary range Most problems in the vocabulary range scale (Table 4) were caused by scale variables linked to specific lexical fields and communicative functions (A2: basic communicative/simple survival needs; B1: basic vocabulary/everyday life; B2: own fields, general topics), particularly on levels A2 and B1. For methodological reasons, these descriptors were inverted in the operationalization. Hence, the resulting scale variables count failures to communicate in the lexical field under consideration (appendix, Table A1).3 All learners nearly constantly communicated successfully in all these lexical fields. Only two speakers had each one problem to make hilmself understood with regard to the (only!) A2 scale variable, while for B1, three learners in T1 (each once) and 5 learners in T2 (N = 8) did not get their messages across. These scale variables were not helpful to distinguish the learners from one another. Vocabulary control Generally, the scale variables were well observable, particularly in T2 (Table 5, appendix, Table A1). Table 5: Observability of vocabulary control scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Errors in every day needs vocabulary A2 2.16 1.83 1.26 1.19 Errors in elementary vocabulary (>4.000) B1 3 2.4 7.74 3.31 Errors in ‘rare’ vocabulary (<4.000) B1 0.58 1.12 0.84 1.12 Major errors B1 1.63 1.39 2.79 1.93 Lexical errors B2 3.6 2.2 8.6 3.2 Synforms B2 0.63 1.12 0.74 1.1 Incorrect word choices B2 2.84 2.39 6.37 2.56 Lexical errors that hinder communication B2 0.37 0.76 1.42 1.5 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Errors in every day needs vocabulary A2 2.16 1.83 1.26 1.19 Errors in elementary vocabulary (>4.000) B1 3 2.4 7.74 3.31 Errors in ‘rare’ vocabulary (<4.000) B1 0.58 1.12 0.84 1.12 Major errors B1 1.63 1.39 2.79 1.93 Lexical errors B2 3.6 2.2 8.6 3.2 Synforms B2 0.63 1.12 0.74 1.1 Incorrect word choices B2 2.84 2.39 6.37 2.56 Lexical errors that hinder communication B2 0.37 0.76 1.42 1.5 Notes: High SD values to be expected for different proficiency levels; all values per production. Table 5: Observability of vocabulary control scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Errors in every day needs vocabulary A2 2.16 1.83 1.26 1.19 Errors in elementary vocabulary (>4.000) B1 3 2.4 7.74 3.31 Errors in ‘rare’ vocabulary (<4.000) B1 0.58 1.12 0.84 1.12 Major errors B1 1.63 1.39 2.79 1.93 Lexical errors B2 3.6 2.2 8.6 3.2 Synforms B2 0.63 1.12 0.74 1.1 Incorrect word choices B2 2.84 2.39 6.37 2.56 Lexical errors that hinder communication B2 0.37 0.76 1.42 1.5 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Errors in every day needs vocabulary A2 2.16 1.83 1.26 1.19 Errors in elementary vocabulary (>4.000) B1 3 2.4 7.74 3.31 Errors in ‘rare’ vocabulary (<4.000) B1 0.58 1.12 0.84 1.12 Major errors B1 1.63 1.39 2.79 1.93 Lexical errors B2 3.6 2.2 8.6 3.2 Synforms B2 0.63 1.12 0.74 1.1 Incorrect word choices B2 2.84 2.39 6.37 2.56 Lexical errors that hinder communication B2 0.37 0.76 1.42 1.5 Notes: High SD values to be expected for different proficiency levels; all values per production. Unexpectedly, almost all learners made lexical errors in communicating every day needs (A2). The B1 level description contains the assumption that more ‘major errors’4 are made when speakers express more complex thoughts. Although intuitive, there is no data to support this hypothesis. Interestingly, T1 yielded a high percentage of errors to be considered ‘major’ (42.26%, SD 30.82), whereas in the more complex T2 in which a controversial issue was to be discussed (‘horoscopes’ or ‘TV talent shows’), the percentage was lower (33.32%, SD 19.8). This is true although the mean number of lexical errors per token is almost leveled in both tasks (T1: 0.3 (SD 0.3); T2: 0.3 (SD 0.2)). A possible explanation is that strategic competence might enable learners to control lexical correctness even when confronted with a more complex topic (‘islands of reliability’, Rohde 1985). The assumption that learners make more lexical errors outside the ‘comfort zone’ of elementary vocabulary (B1) is problematic because lexical items rarer than the first 4,000 words (‘rare words’) of ITA/GER (Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010) were hardly observable. Even if the participants made more errors in rare words than was to be expected from their total occurrence (i.e. 27 out of 266 lexical errors, or 10.15%), no learner used less than 90% of basic vocabulary, and some did not use a single rare word (mean T1: 3.9% (SD 2.27); T2: 2.53 (SD 1.84%)). This scale variable thus could not help distinguish learners from one another in free production tasks. It requires a different task type to decide if the almost exclusive recurrence to basic vocabulary is due to a lack of declarative lexical knowledge or to lexical avoidance strategies. CLUSTERING LEARNER LANGUAGE WITH CEFR SCALE VARIABLES Fluency Few scale variable correlations were significant inside and across task/language(s). The ones more coherently related to each other mirror temporal aspects of fluency (B2: number of pauses/syllable and mean length of pauses, 0.775 in T1, 0.814 in T2). However, these scale variables were not related to other, more unusual scale variables like ‘B1 pauses’ (Table 6). Table 6: Cluster solutions for the fluency scale Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 (without false starts) T1 ITA Length of pauses Λ = .474, p = .018 GER Complexity Λ = .174, p = .001 B1 T2 ITA ↑: Pauses dep. on utter. length Λ = .075, p = .009 ↓: Incompreh. utter. Λ = .309, p = .040 GER ↑: Pauses dep. on utter. length Λ = .092, p = .003 Contradictory A2 T2 ITA Number of pauses Λ = .189, p = .000 GER Number of pauses Λ = .248, p = .002 B2 T1 ITA – – GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 (without false starts) T1 ITA Length of pauses Λ = .474, p = .018 GER Complexity Λ = .174, p = .001 B1 T2 ITA ↑: Pauses dep. on utter. length Λ = .075, p = .009 ↓: Incompreh. utter. Λ = .309, p = .040 GER ↑: Pauses dep. on utter. length Λ = .092, p = .003 Contradictory A2 T2 ITA Number of pauses Λ = .189, p = .000 GER Number of pauses Λ = .248, p = .002 B2 T1 ITA – – GER – – Notes: Wilks’ Lambda is a measure of the proportion of variance NOT explained by the variable. In case of the B1 level, the target cluster can be separated from ‘higher’ (↑) and ‘lower’ (↓) clusters. All measures normalized (per minute, syllable, token). Table 6: Cluster solutions for the fluency scale Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 (without false starts) T1 ITA Length of pauses Λ = .474, p = .018 GER Complexity Λ = .174, p = .001 B1 T2 ITA ↑: Pauses dep. on utter. length Λ = .075, p = .009 ↓: Incompreh. utter. Λ = .309, p = .040 GER ↑: Pauses dep. on utter. length Λ = .092, p = .003 Contradictory A2 T2 ITA Number of pauses Λ = .189, p = .000 GER Number of pauses Λ = .248, p = .002 B2 T1 ITA – – GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 (without false starts) T1 ITA Length of pauses Λ = .474, p = .018 GER Complexity Λ = .174, p = .001 B1 T2 ITA ↑: Pauses dep. on utter. length Λ = .075, p = .009 ↓: Incompreh. utter. Λ = .309, p = .040 GER ↑: Pauses dep. on utter. length Λ = .092, p = .003 Contradictory A2 T2 ITA Number of pauses Λ = .189, p = .000 GER Number of pauses Λ = .248, p = .002 B2 T1 ITA – – GER – – Notes: Wilks’ Lambda is a measure of the proportion of variance NOT explained by the variable. In case of the B1 level, the target cluster can be separated from ‘higher’ (↑) and ‘lower’ (↓) clusters. All measures normalized (per minute, syllable, token). Clusters that were completely statistically homogeneous (F < 1) and plausible, i.e. corresponding to CEFR predictions (‘good’ solutions in Table 6), could only be found for A2/T1 (if ‘false starts’ were excluded) and for B1/T2. Statistically homogeneous clusters with scale variables assuming values not foreseen in the scale (‘contradictory’ in Table 6) were found in two scenarios (A2/T2 and B2/T1). Discriminant analyses showed that single scale variables helped explain variance between ‘CEFR clusters’ and other learners, although in many cases they had only very weak explanatory power. For the other scenarios, no clustering was possible (Table 7). Learners regularly behaved differently from what level descriptions suggest. Scale variables resulted in very heterogeneous clusters. Table 7 also shows that the feasibility of the clusterings was generally more dependent on the task than on the target language. Table 7: Overview possibility of cluster solutions Notes: Good cluster solutions with F < 1 & correspondence to scale predictions (), contradictory cluster solutions with F < 1 but no correspondence to scale predictions (), no clustering possible (). Table 7: Overview possibility of cluster solutions Notes: Good cluster solutions with F < 1 & correspondence to scale predictions (), contradictory cluster solutions with F < 1 but no correspondence to scale predictions (), no clustering possible (). Many speakers’ performances fitted more than one level description (N = 9), others none (N = 3 in T1, but N = 8 in T2, Table 8). Table 8: Learners in clusters Fluency Vocabulary range Vocabulary control T1 T2 T1 T2 T1 T2 Language Learner Cluster Cluster Cluster Cluster Cluster Cluster ITA 1 B1 – B2 – A2 – 2 – A2/B1 – B1 A2 A2/B1 3 B1/B2 – B2 – A2/B1/B2 A2/B1/B2 4 A2 B1/B2 – – – A2 5 A2/B1 A2 – B2 B2 A2 6 A2 A2 – B2 – A2 7 B2 – B2 – A2/B1/B2 A2/B1/B2 8 – – B2 – A2 A2 9 B2 A2 – B1/B2 B2 A2 10 A2/B1/B2 B1 – – B2 A2/B1/B2 GER 11 A2/B1 A2 – – – – 12 A2 A2 B2 – – A2 13 B2 – B2 B2 B2 A2/B2 14 – – – – A2/B1/B2 A2/B1/B2 15 B1/B2 – – B2 B1/B2 A2/B1 16 B2 B1/B2 – – A2 A2/B1/B2 17 B2 – B2 – B2 A2 18 A2/B1 A2 B2 – A2/B1/B2 B1/B2 19 A2 A2 – – – – Fluency Vocabulary range Vocabulary control T1 T2 T1 T2 T1 T2 Language Learner Cluster Cluster Cluster Cluster Cluster Cluster ITA 1 B1 – B2 – A2 – 2 – A2/B1 – B1 A2 A2/B1 3 B1/B2 – B2 – A2/B1/B2 A2/B1/B2 4 A2 B1/B2 – – – A2 5 A2/B1 A2 – B2 B2 A2 6 A2 A2 – B2 – A2 7 B2 – B2 – A2/B1/B2 A2/B1/B2 8 – – B2 – A2 A2 9 B2 A2 – B1/B2 B2 A2 10 A2/B1/B2 B1 – – B2 A2/B1/B2 GER 11 A2/B1 A2 – – – – 12 A2 A2 B2 – – A2 13 B2 – B2 B2 B2 A2/B2 14 – – – – A2/B1/B2 A2/B1/B2 15 B1/B2 – – B2 B1/B2 A2/B1 16 B2 B1/B2 – – A2 A2/B1/B2 17 B2 – B2 – B2 A2 18 A2/B1 A2 B2 – A2/B1/B2 B1/B2 19 A2 A2 – – – – Table 8: Learners in clusters Fluency Vocabulary range Vocabulary control T1 T2 T1 T2 T1 T2 Language Learner Cluster Cluster Cluster Cluster Cluster Cluster ITA 1 B1 – B2 – A2 – 2 – A2/B1 – B1 A2 A2/B1 3 B1/B2 – B2 – A2/B1/B2 A2/B1/B2 4 A2 B1/B2 – – – A2 5 A2/B1 A2 – B2 B2 A2 6 A2 A2 – B2 – A2 7 B2 – B2 – A2/B1/B2 A2/B1/B2 8 – – B2 – A2 A2 9 B2 A2 – B1/B2 B2 A2 10 A2/B1/B2 B1 – – B2 A2/B1/B2 GER 11 A2/B1 A2 – – – – 12 A2 A2 B2 – – A2 13 B2 – B2 B2 B2 A2/B2 14 – – – – A2/B1/B2 A2/B1/B2 15 B1/B2 – – B2 B1/B2 A2/B1 16 B2 B1/B2 – – A2 A2/B1/B2 17 B2 – B2 – B2 A2 18 A2/B1 A2 B2 – A2/B1/B2 B1/B2 19 A2 A2 – – – – Fluency Vocabulary range Vocabulary control T1 T2 T1 T2 T1 T2 Language Learner Cluster Cluster Cluster Cluster Cluster Cluster ITA 1 B1 – B2 – A2 – 2 – A2/B1 – B1 A2 A2/B1 3 B1/B2 – B2 – A2/B1/B2 A2/B1/B2 4 A2 B1/B2 – – – A2 5 A2/B1 A2 – B2 B2 A2 6 A2 A2 – B2 – A2 7 B2 – B2 – A2/B1/B2 A2/B1/B2 8 – – B2 – A2 A2 9 B2 A2 – B1/B2 B2 A2 10 A2/B1/B2 B1 – – B2 A2/B1/B2 GER 11 A2/B1 A2 – – – – 12 A2 A2 B2 – – A2 13 B2 – B2 B2 B2 A2/B2 14 – – – – A2/B1/B2 A2/B1/B2 15 B1/B2 – – B2 B1/B2 A2/B1 16 B2 B1/B2 – – A2 A2/B1/B2 17 B2 – B2 – B2 A2 18 A2/B1 A2 B2 – A2/B1/B2 B1/B2 19 A2 A2 – – – – Vocabulary range For the vocabulary range scale variables, there is not one consistent correlation across languages and tasks. There are no significant correlations among the B1 variables, and only two on B2 (GER/T1: 0.724, circumlocutions and repetitions per syllable; GER/T2: 0.814, incomprehensible utterances/token and lexical planning pauses/syllable). Only on level B2 and B1 it was possible to find ‘CEFR clusters’ (Tables 7 and 9). Most learners fitted no level description (T1: N = 11, T2: N = 13, Table 8). Table 9: Cluster solutions for the vocabulary range scale Quality Level Task Language Variance Variable Wilks’ Lambda Good B1 T2 ITA ↓ Incompreh. utter., Λ = .208, p = .017 ↑ Circumlocutions Λ = .002, p = .000 B2 T1 ITA – – GER – – B2 T2 ITA Lex. planning pauses Λ = .417, p = .010 Repetitions Λ = .159, p = .002 GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good B1 T2 ITA ↓ Incompreh. utter., Λ = .208, p = .017 ↑ Circumlocutions Λ = .002, p = .000 B2 T1 ITA – – GER – – B2 T2 ITA Lex. planning pauses Λ = .417, p = .010 Repetitions Λ = .159, p = .002 GER – – Table 9: Cluster solutions for the vocabulary range scale Quality Level Task Language Variance Variable Wilks’ Lambda Good B1 T2 ITA ↓ Incompreh. utter., Λ = .208, p = .017 ↑ Circumlocutions Λ = .002, p = .000 B2 T1 ITA – – GER – – B2 T2 ITA Lex. planning pauses Λ = .417, p = .010 Repetitions Λ = .159, p = .002 GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good B1 T2 ITA ↓ Incompreh. utter., Λ = .208, p = .017 ↑ Circumlocutions Λ = .002, p = .000 B2 T1 ITA – – GER – – B2 T2 ITA Lex. planning pauses Λ = .417, p = .010 Repetitions Λ = .159, p = .002 GER – – For the B1 level descriptions, in the other scenarios (GER & ITA/T1 and GER/T2) the clusters assumed values that systematically contradicted the CEFR scale. As the A2 level description was true for all individuals in the sample, it could not be used to group speakers. Vocabulary control The vocabulary control scale variables correlate modestly. On B1, the ‘number of errors in elementary vocabulary/token’ correlates quite consistently with the ‘number of major errors/token’ (GER/T1: 0.915, T2: 0.813, ITA/T2: 0.899). On level B2, the ‘number of incorrect word choices/token’ correlates with the ‘number of lexical errors/token’ (GER/T1: 0.954, GER/T2: 0.968; ITA/T1: 0.864, ITA/T2: 0.914). Clusters are more ambiguous than for the other scales (see Table 8): Many productions correspond well to two (N = 5) or even three (N = 9) level descriptions. More than half of the productions in the sample thus could not unambiguously be matched to one level description. On B1, clusters could be found, although again the scale variables showed strongly variable patterns, and none of them significantly reduced variance between clusters. On B2, although some individuals’ behavior corresponded to the scale, single scale variables were rarely observable (Table 10) so that the plausibility of the solutions is questionable; also, statistical homogeneity was partially threatened. Table 10: Cluster solutions for the vocabulary control scale Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 T1 ITA Errors everyday needs Λ = .258, p = .001 GER Errors everyday needs Λ = .173, p = .001 A2 T2 ITA – – GER Errors everyday needs Λ = .127, p = .000 B1 T1 – – – – B1 T2 – – – – Contradictory B2 T1 ITA Lexical errors Λ = .459, p = .024 GER Lexical errors incompr. utter. Λ = .195, p = .001 Λ = .096, p = .001 B2 T2 ITA Incompr. utter. Λ = .569, p = .039 GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 T1 ITA Errors everyday needs Λ = .258, p = .001 GER Errors everyday needs Λ = .173, p = .001 A2 T2 ITA – – GER Errors everyday needs Λ = .127, p = .000 B1 T1 – – – – B1 T2 – – – – Contradictory B2 T1 ITA Lexical errors Λ = .459, p = .024 GER Lexical errors incompr. utter. Λ = .195, p = .001 Λ = .096, p = .001 B2 T2 ITA Incompr. utter. Λ = .569, p = .039 GER – – Table 10: Cluster solutions for the vocabulary control scale Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 T1 ITA Errors everyday needs Λ = .258, p = .001 GER Errors everyday needs Λ = .173, p = .001 A2 T2 ITA – – GER Errors everyday needs Λ = .127, p = .000 B1 T1 – – – – B1 T2 – – – – Contradictory B2 T1 ITA Lexical errors Λ = .459, p = .024 GER Lexical errors incompr. utter. Λ = .195, p = .001 Λ = .096, p = .001 B2 T2 ITA Incompr. utter. Λ = .569, p = .039 GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 T1 ITA Errors everyday needs Λ = .258, p = .001 GER Errors everyday needs Λ = .173, p = .001 A2 T2 ITA – – GER Errors everyday needs Λ = .127, p = .000 B1 T1 – – – – B1 T2 – – – – Contradictory B2 T1 ITA Lexical errors Λ = .459, p = .024 GER Lexical errors incompr. utter. Λ = .195, p = .001 Λ = .096, p = .001 B2 T2 ITA Incompr. utter. Λ = .569, p = .039 GER – – RELATING SCALE VARIABLES TO RESEARCH-BASED MEASURES Fluency In a first step, correlations between scale variables and research-based measures were analysed. The analysis showed mixed results, with most correlations found for the A2 level and more consistent correlations in T1 (Table A2 for list of measures). On level A2 and B2, the more functional scale variables have consistent correlations to a temporal ‘low order fluency’ (Lennon 2000).5 In addition, T-tests were used to find fluency-related measures distinguishing CEFR clusters (where feasible, except from B2 with N = 1 in the cluster) from other speakers. Table 11 shows that three standard measures of fluency (the phonation-time ratio, the mean length of runs, and the number of long pauses) significantly differentiate between speakers clustered as A2 and other speakers (p < 0.05; effect sizes are medium to high). However, for the B1 fluency cluster no homogeneous set of fluency measures supporting the cluster could be found. Table 11: Fluency-related measures and clusters Level Task Cluster quality Language Measure p-value of T-Test Effect size (η2) A2 T1 Good ITA Phonation-time ratio .04 .428 T1 Good GER Phonation-time ratio .042 .467 T2 Restricted ITA Phonation-time ratio .011 .710 T2 Restricted GER Phonation-time ratio .001 .003 T2 Restricted ITA Mean length of runs .002 .735 T2 Restricted GER Mean length of runs .003 .729 T1 Good GER Long pauses/minute .045 .459 T2 Restricted ITA Long pauses/minute .047 .408 T2 Restricted GER Long pauses/minute .045 .459 B1 T2 Restricted ITA ↓ Speech rate .018 .863 T2 Restricted ITA ↓ Phonation-time ratio .039 .792 T2 Restricted ITA ↓ Repetitions/token .017 .698 T2 Restricted ITA ↓ Constituent-internal pauses/min. .006 .612 T2 Restricted GER↑ Repetitions/token .003 .816 T2 Restricted GER↑ Hesitation phenomena/token .011 .549 T2 Restricted GER↑ Lengthenings/token .040 .292 Level Task Cluster quality Language Measure p-value of T-Test Effect size (η2) A2 T1 Good ITA Phonation-time ratio .04 .428 T1 Good GER Phonation-time ratio .042 .467 T2 Restricted ITA Phonation-time ratio .011 .710 T2 Restricted GER Phonation-time ratio .001 .003 T2 Restricted ITA Mean length of runs .002 .735 T2 Restricted GER Mean length of runs .003 .729 T1 Good GER Long pauses/minute .045 .459 T2 Restricted ITA Long pauses/minute .047 .408 T2 Restricted GER Long pauses/minute .045 .459 B1 T2 Restricted ITA ↓ Speech rate .018 .863 T2 Restricted ITA ↓ Phonation-time ratio .039 .792 T2 Restricted ITA ↓ Repetitions/token .017 .698 T2 Restricted ITA ↓ Constituent-internal pauses/min. .006 .612 T2 Restricted GER↑ Repetitions/token .003 .816 T2 Restricted GER↑ Hesitation phenomena/token .011 .549 T2 Restricted GER↑ Lengthenings/token .040 .292 Notes: T-test results; significance at 5% level. Table 11: Fluency-related measures and clusters Level Task Cluster quality Language Measure p-value of T-Test Effect size (η2) A2 T1 Good ITA Phonation-time ratio .04 .428 T1 Good GER Phonation-time ratio .042 .467 T2 Restricted ITA Phonation-time ratio .011 .710 T2 Restricted GER Phonation-time ratio .001 .003 T2 Restricted ITA Mean length of runs .002 .735 T2 Restricted GER Mean length of runs .003 .729 T1 Good GER Long pauses/minute .045 .459 T2 Restricted ITA Long pauses/minute .047 .408 T2 Restricted GER Long pauses/minute .045 .459 B1 T2 Restricted ITA ↓ Speech rate .018 .863 T2 Restricted ITA ↓ Phonation-time ratio .039 .792 T2 Restricted ITA ↓ Repetitions/token .017 .698 T2 Restricted ITA ↓ Constituent-internal pauses/min. .006 .612 T2 Restricted GER↑ Repetitions/token .003 .816 T2 Restricted GER↑ Hesitation phenomena/token .011 .549 T2 Restricted GER↑ Lengthenings/token .040 .292 Level Task Cluster quality Language Measure p-value of T-Test Effect size (η2) A2 T1 Good ITA Phonation-time ratio .04 .428 T1 Good GER Phonation-time ratio .042 .467 T2 Restricted ITA Phonation-time ratio .011 .710 T2 Restricted GER Phonation-time ratio .001 .003 T2 Restricted ITA Mean length of runs .002 .735 T2 Restricted GER Mean length of runs .003 .729 T1 Good GER Long pauses/minute .045 .459 T2 Restricted ITA Long pauses/minute .047 .408 T2 Restricted GER Long pauses/minute .045 .459 B1 T2 Restricted ITA ↓ Speech rate .018 .863 T2 Restricted ITA ↓ Phonation-time ratio .039 .792 T2 Restricted ITA ↓ Repetitions/token .017 .698 T2 Restricted ITA ↓ Constituent-internal pauses/min. .006 .612 T2 Restricted GER↑ Repetitions/token .003 .816 T2 Restricted GER↑ Hesitation phenomena/token .011 .549 T2 Restricted GER↑ Lengthenings/token .040 .292 Notes: T-test results; significance at 5% level. Vocabulary range While the scale variables related to concrete lexical fields were difficult to observe, none of the remaining scale variables was correlated to the construct of vocabulary breadth, nor in most cases to any other aspect of lexical competence (Table A3 for measures; Read 2000, 2007; Read and Chapelle 2001; Nation 2001, Wisniewski 2012). The variables regarding lexical planning pauses, circumlocutions, and repetitions (B2) seem to be related to hesitation phenomena, but the correlation analyses results are heterogeneous (languages/tasks); also, to an even lesser degree, there are sporadic links to communication strategies.6 Only single aspects in single scenarios showed significant differences between the clusters. For example, on B1, only for ITA/T2 a cluster solution was possible. These B1 speakers were distinguishable from speakers clustered higher in that they used more literal translations from their L1. B2-clustered speakers used less very common words than weaker speakers, their Guiraud’s Index was higher, and they used, surprisingly, more reformulations (GER/T2: mean 0.014, SD 0.001; weaker clusters: mean 0.006, SD 0.004). However, effect sizes are rather low (Table 12). Table 12: Vocabulary range measures distinguishing between CEFR clusters Level Task Language Measure p-value Effect size (η2) B1 T2 ITA↑ Transfer/token .049 .001 B2 T1 ITA % Most frequent 1,000 words .048 .320 B2 T1 GET Guiraud .046 .143 B2 T2 GER Reformulations/token .005 .521 Level Task Language Measure p-value Effect size (η2) B1 T2 ITA↑ Transfer/token .049 .001 B2 T1 ITA % Most frequent 1,000 words .048 .320 B2 T1 GET Guiraud .046 .143 B2 T2 GER Reformulations/token .005 .521 Notes: T-Test results; significance at 5% level. Table 12: Vocabulary range measures distinguishing between CEFR clusters Level Task Language Measure p-value Effect size (η2) B1 T2 ITA↑ Transfer/token .049 .001 B2 T1 ITA % Most frequent 1,000 words .048 .320 B2 T1 GET Guiraud .046 .143 B2 T2 GER Reformulations/token .005 .521 Level Task Language Measure p-value Effect size (η2) B1 T2 ITA↑ Transfer/token .049 .001 B2 T1 ITA % Most frequent 1,000 words .048 .320 B2 T1 GET Guiraud .046 .143 B2 T2 GER Reformulations/token .005 .521 Notes: T-Test results; significance at 5% level. Vocabulary control Some vocabulary control scale variables were correlated to construct-based measures of lexical accuracy, that is, to the number of lexical errors/token, to the Lexical Quality Indicator and, in the monologue task, to the percentage of error-free AS units (Table A3, appendix), although it was not easy to find measures correlated to all language-task scenarios (Table 13). Table 13: Correlations of measures of lexical correctness with CEFR scale variables (vocabulary control) Level Task Scale variable Measure Correlation (r) A2 T1 ‘Errors everyday needs vocabulary’ Lexical errors/token .627 T2 .460 B1 T1 ‘Errors in elementary vocabulary’ .934 T2 .982 T1 ‘Major errors’ .914 T2 .743 A2 T1 ‘Errors everyday needs vocabulary’ Lexical Quality Indicator −.543 T2 −.494 B1 T1 ‘Errors in elementary vocabulary’ −.666 T2 −.636 T1 ‘Major errors’ −.708 T2 −.647 B2 T1 ‘Lexical errors’ −.701 T2 −.644 A2 T1 ‘Errors everyday needs vocabulary’ % Error-free AS-units −.689 B1 T1 ‘Errors in elementary vocabulary’ −.679 B1 T1 ‘Major errors’ −.582 B2 T1 ‘Lexical errors’ −.668 Level Task Scale variable Measure Correlation (r) A2 T1 ‘Errors everyday needs vocabulary’ Lexical errors/token .627 T2 .460 B1 T1 ‘Errors in elementary vocabulary’ .934 T2 .982 T1 ‘Major errors’ .914 T2 .743 A2 T1 ‘Errors everyday needs vocabulary’ Lexical Quality Indicator −.543 T2 −.494 B1 T1 ‘Errors in elementary vocabulary’ −.666 T2 −.636 T1 ‘Major errors’ −.708 T2 −.647 B2 T1 ‘Lexical errors’ −.701 T2 −.644 A2 T1 ‘Errors everyday needs vocabulary’ % Error-free AS-units −.689 B1 T1 ‘Errors in elementary vocabulary’ −.679 B1 T1 ‘Major errors’ −.582 B2 T1 ‘Lexical errors’ −.668 Notes: Correlations given if consistently significant (p <.05) for both languages and tasks; values calculated across languages. All scale variables per token (see Table AI). Table 13: Correlations of measures of lexical correctness with CEFR scale variables (vocabulary control) Level Task Scale variable Measure Correlation (r) A2 T1 ‘Errors everyday needs vocabulary’ Lexical errors/token .627 T2 .460 B1 T1 ‘Errors in elementary vocabulary’ .934 T2 .982 T1 ‘Major errors’ .914 T2 .743 A2 T1 ‘Errors everyday needs vocabulary’ Lexical Quality Indicator −.543 T2 −.494 B1 T1 ‘Errors in elementary vocabulary’ −.666 T2 −.636 T1 ‘Major errors’ −.708 T2 −.647 B2 T1 ‘Lexical errors’ −.701 T2 −.644 A2 T1 ‘Errors everyday needs vocabulary’ % Error-free AS-units −.689 B1 T1 ‘Errors in elementary vocabulary’ −.679 B1 T1 ‘Major errors’ −.582 B2 T1 ‘Lexical errors’ −.668 Level Task Scale variable Measure Correlation (r) A2 T1 ‘Errors everyday needs vocabulary’ Lexical errors/token .627 T2 .460 B1 T1 ‘Errors in elementary vocabulary’ .934 T2 .982 T1 ‘Major errors’ .914 T2 .743 A2 T1 ‘Errors everyday needs vocabulary’ Lexical Quality Indicator −.543 T2 −.494 B1 T1 ‘Errors in elementary vocabulary’ −.666 T2 −.636 T1 ‘Major errors’ −.708 T2 −.647 B2 T1 ‘Lexical errors’ −.701 T2 −.644 A2 T1 ‘Errors everyday needs vocabulary’ % Error-free AS-units −.689 B1 T1 ‘Errors in elementary vocabulary’ −.679 B1 T1 ‘Major errors’ −.582 B2 T1 ‘Lexical errors’ −.668 Notes: Correlations given if consistently significant (p <.05) for both languages and tasks; values calculated across languages. All scale variables per token (see Table AI). However, in T-tests and discriminant analyses no measure of lexical accuracy could clearly separate the clusters if taken as a whole. This suggests that single scale variables might have a clearer link to the construct than the combined level description scale variables. A particular difficulty is the vertical positioning of the A2 scale variable. Learners fitting the level description had high values in several lexical (accuracy) measures (Table 14). The A2 scale variable (errors in concrete everyday needs vocabulary/token) was thus relatable to the lexical accuracy construct, but seems to indicate a higher level of ability. Table 14: Research-based measures of vocabulary control in the A2 cluster and in productions clustered higher than A2 Task/ language Measure p-value Effect size (η2) Mean A2 cluster (SD) Mean higher clusters (SD) T1/GER Lexical Quality Indicator .019 .163 69 (21.1) 39 (7.1) % Rare words .05 .420 4.04 (3.49) 3.17 (1.8) T1/ITA % Rare words .048 .235 3.8 (2.12) 1.81 (.21) T2/GER Errors in formulaic sequences/token .042 .205 .003 (.003) .006 (.001) T2/GER Lexical Quality Indicator .037 .368 104.14 (37.74) 4.5 (17.68) T2/ITA Errors in formulaic sequences/token .047 .408 .004 (.002) .01 (.00) Task/ language Measure p-value Effect size (η2) Mean A2 cluster (SD) Mean higher clusters (SD) T1/GER Lexical Quality Indicator .019 .163 69 (21.1) 39 (7.1) % Rare words .05 .420 4.04 (3.49) 3.17 (1.8) T1/ITA % Rare words .048 .235 3.8 (2.12) 1.81 (.21) T2/GER Errors in formulaic sequences/token .042 .205 .003 (.003) .006 (.001) T2/GER Lexical Quality Indicator .037 .368 104.14 (37.74) 4.5 (17.68) T2/ITA Errors in formulaic sequences/token .047 .408 .004 (.002) .01 (.00) Notes: T-test results (significance at 5% level) and effect sizes. Table 14: Research-based measures of vocabulary control in the A2 cluster and in productions clustered higher than A2 Task/ language Measure p-value Effect size (η2) Mean A2 cluster (SD) Mean higher clusters (SD) T1/GER Lexical Quality Indicator .019 .163 69 (21.1) 39 (7.1) % Rare words .05 .420 4.04 (3.49) 3.17 (1.8) T1/ITA % Rare words .048 .235 3.8 (2.12) 1.81 (.21) T2/GER Errors in formulaic sequences/token .042 .205 .003 (.003) .006 (.001) T2/GER Lexical Quality Indicator .037 .368 104.14 (37.74) 4.5 (17.68) T2/ITA Errors in formulaic sequences/token .047 .408 .004 (.002) .01 (.00) Task/ language Measure p-value Effect size (η2) Mean A2 cluster (SD) Mean higher clusters (SD) T1/GER Lexical Quality Indicator .019 .163 69 (21.1) 39 (7.1) % Rare words .05 .420 4.04 (3.49) 3.17 (1.8) T1/ITA % Rare words .048 .235 3.8 (2.12) 1.81 (.21) T2/GER Errors in formulaic sequences/token .042 .205 .003 (.003) .006 (.001) T2/GER Lexical Quality Indicator .037 .368 104.14 (37.74) 4.5 (17.68) T2/ITA Errors in formulaic sequences/token .047 .408 .004 (.002) .01 (.00) Notes: T-test results (significance at 5% level) and effect sizes. There is also the danger of construct-irrelevance. On B2, two scale variables (‘confusions/synforms’; ‘errors that hinder communication’) could not be connected to lexical accuracy. ‘Synforms’, hardly observable, were weakly related to aspects of lexis other than its accuracy, particularly sophistication (T2, correlations: % rare words 0.548; weighted lexical density 0.502; Advanced Guiraud 0.528). The B2 scale variable related to communicative success almost exclusively correlated with comprehensibility of utterances (incomprehensible utterances in basic vocabulary/everyday life: T1: 0.726, T2: 0.768; incomprehensible utterances in general topics T1: 0.989, T2: 0.923). SUMMARY AND DISCUSSION The results show that (i) the observability of operationalized descriptors (‘scale variables’) was limited. Some were not as salient as suggested. Others assumed evenly distributed values among participants, raising questions about their typicality and leading to level assignment dilemmata. Still others were hardly observed at all, or they were speculative. The number of scale variables per level is very low so that a lack of observability can put the usefulness of the whole level description at stake. It was easiest to find correlates for the fluency scale, while the vocabulary range scale was particularly problematic. Additionally, some assumptions underlying the scales (e.g. more lexical errors in more complex topics) were not confirmed. Second, the attempt at (ii) matching learner productions to (the operationalizable parts of) CEFR scales led to consistency problems. Correlations among scale variables were weak, often inconsistent, and sometimes contradictory. Scale variables were not sufficient to describe what learners did, and satisfying clusters could not be found for all level descriptions. Often, data contradicted scale predictions. Scale variable values were observable in many different patterns, so that the level descriptions did not seem to capture typical behavior. In those cases, level descriptions captured what some learners did, while ignoring many others. Many productions did not match any level description, while plenty fitted two or even three of them (Table 8). Again, the fluency scale was least problematic. Furthermore, it was only partly possible to (iii) link scale variables to research-based measures. The fluency scale variables were related to a temporal construct of ‘low order fluency’ (Lennon 2000). The vocabulary control scale contains some more functional scale variables, all related to the number of errors. Thus, analogously, a ‘low order accuracy’ construct is the least problematic—sometimes the only applicable—aspect of this scale. Both construct interpretations contradict the approach to language learning of the CEFR. It was impossible to relate the vocabulary range scale variables to a lexical construct. Other drawbacks regard construct-irrelevance and questionable vertical positioning of single scale variables. No additional information on the complexity of the three constructs is provided for CEFR users (e.g. for fluency: difference between utterance/cognitive/perceived fluency, influence of tasks, automatization and more, Segalowitz 2010)—the CEFR text thus does not help to even out scale inconsistencies. Since the CEFR claims to be valid across languages (Alderson 2007: 660; Fulcher 2008: 167), differences between target languages (Italian/German) should be minimal. While in fact some results varied with the target language, there was a stronger tendency for the task type to make a difference with regard to observability, especially for the vocabulary scales. It seems all but trivial to construct one (or more) productive task(s) that contains language material ratable with the CEFR vocabulary scales (Bachman and Palmer 2010: 351). As pointed out above, this study is not to be misunderstood as a comprehensive CEFR scale validation. The time-consuming methodology made it necessary to focus on a clearly outlined context. This explorative study takes into account language samples elicited in two oral proficiency test tasks by N = 19 learners of Italian and German. Obviously, the language produced in this context does not create a comprehensive picture of the participants’ L2. Also, the study might have yielded different results if the tasks had been designed to speficically get participants to show their best fluency, vocabulary control, and knowledge of vocabulary, or if a non-test setting had been parted from. Still, the language samples can reasonably be believed to represent what is regularly done in proficiency tests, which are a context in which it is particularly desirable for CEFR scale contents to be useful. In the given context, the results imply that it can be very problematic to use CEFR scales for the description of learner language. However, the results cannot be transferred to other contexts of language use without further empirical analyses. It is highly desirable for future validation studies to extend the range of analysed task types. Many studies overarching a wide range of languages and tasks involving longer stretches of written and spoken speech produced in a broad range of contexts are needed in order to address the empirical validity of CEFR scales more fully. This study, exploratory in nature and constrained in size, is meant as a first step into that direction. While its results must not be generalized, it might help to deliver a ‘snapshot’ (North 2014b: 23) of the empirical robustness of selected level descriptions of three CEFR scales in a specific context. Addressing the empirical validity aspects in focus here is believed to be possible only without reliance on human ratings. As a consequence, this study adopts a perspective that links CEFR scales to learner language in a very literal way, as objectively as possible. Not all aspects of the three scales could be operationalized. This procedure would seem a rather artificial thing to do in real-world contexts, but is considered unavoidable for scale validation purposes. CONCLUSION The Common European Framework of Reference has contributed considerably to improving the quality and transparency of language teaching, learning, and assessment in Europe (and beyond). Its learner-oriented, communicative approach to L2 language competence has profoundly influenced the reality of language education. Contrary to the authoring team’s original intentions, the scales are often understood as the ‘core’ of the CEFR. Fulcher et al. (2011: 232) diagnosed a ‘reification’ of the CEFR level system, intending the process of a convention that assumes the alleged character of a hard fact. It is the use that is made of the CEFR scales rather than the instruments themselves that constitutes a major problem. In many situations, the scales are misunderstood and over-interpreted. Often their suitability to describe L2 competence is overestimated. This tendency becomes dangerous if decisions about learners’ lifes are taken on the basis of CEFR scales. Thus, although the strengths and the constraints of the CEFR scales have been repeatedly defined by their authors (e.g. North 2000; North 2014), we are now in a situation where school curricula, educational standards, and language tests are readily related to CEFR levels, whilst many validity aspects of the scales have not yet been examined (North 2014: 44). This explorative study, in an attempt at non-circular validation of empirical robustness aspects, has shown that for the sample under consideration and the context of analysis, there are shortcomings in the empirical validity of the CEFR scales for fluency, vocabulary range and vocabulary control on the levels A2-B2. As mentioned, these results must not be generalized. However, they underline the need for more validation studies based on empirical learner language, for all CEFR scales and levels, for written and spoken L2, and for more languages. Such projects could greatly profit from the creation of large learner corpora (see Wisniewski, forthcoming). Fortunately, the CEFR defines itself open to revisions (CoE 2001: 8) so that it might be possible to integrate validation findings into future scale versions.Until we know more about how shaky the ground beneath the CEFR really is (Hulstijn 2007), however, great care ought to be taken when referring to CEFR levels. Conflict of interest statement. None declared. NOTES Footnotes 1 Wisniewski (2014) contains a study of the theoretical validity of the fluency and both vocabulary scales. For reasons of space, the results cannot be discussed here. 2 A recent initiative aiming at both illustration and validation of CEFR levels is MERLIN (www.merlin-platform.eu) that compiled a trilingual, freely accessible annotated learner corpus (Wisniewski et al. 2013, Abel et al. 2014). 3 The lexical/functional fields were defined in tagsets. All unclear utterances were re-assessed (basic communicative needs/everyday vocabulary/general vocabulary), independently of the determination of type frequency which was also carried through. 4 The CEFR text does not define lexical errors. The operationalization of the scale concept of “error severeness” here rests on the assumptions that errors are more severe if they 1) lead to communication problems and 2) occur in the communication of everyday/survival needs. Coders additionally classified errors according to their appearance in concrete vs abstract and specifc vs general topics. 5 ‘Functional’ scale variables on A2: length/number of pauses (e.g. phonation-time ratio −0.631/−0.795 in T1, mean length of runs T1 −0.906/−0.928, speech rate −0.680/−0.849, in T1, −0.559 in T2, number of long pauses/syllable or minute 0.905/0.766 T1, 0.833/0.731, T2). These scale variables did not significantly correlate with hesitation phenomena or strategies. B1: ‘B1 pauses’ (for lexical/grammatical planning/repair) and number of pauses/syllable (T2: 0.621), speech rate (T2: −0.707) and ‘keep goings’ (continuing speaking after a lexical problem, T1: 0.811, T2: 0.599). B2: pause-related scale variables (number of pauses/syllable and number of long pauses/syllable) consistently related to fluency-based measures (particularly in T1: phonation-time ratio −0.786/−0.782; mean length of runs −0.624/−0.906; speech rate −0.808/−0.849). 6 Repetitions (.56, T2 across languages), lexical planning pauses (.912, ITA/T1 and .659 across languages, T2), and circumlocutions (.669, GER/T1) were correlated to events in which learners kept going after a lexical formulation difficulty (indicators: occurrence of an L1/L3 expression, deadend, circumlocution, lexical error, inappropriate lexical item/generalization plus one or more pauses). Lexical planning pauses were correlated to filled pauses (.817, T1, across languages), while circumlocutions had a correlation with repairs (−0.714, ITA/T2) and false starts (.691, ITA/T2). Phonetic lengthenings were correlated with lexical planning pauses (.763, T2) and circumlocutions (.717, T1) for GER L2. REFERENCES Abel A. , Vettori C. , Wisniewski K. (eds). 2012 . Gli studenti altoatesini e la seconda lingua: indagine linguistica e psicosociale./Die Südtiroler SchülerInnen und die Zweitsprache: eine linguistische und sozialpsychologische Untersuchung , vol. 2 . Eurac . Available at http://www.eurac.edu/de/research/Publications/Pages/publicationdetails.aspx?pubId=0100156&type=Q. Abel A. , Nicolas L. , Wisniewski K. , Boyd K. A. , Hana J. . 2014 . ‘ A trilingual learner corpus illustrating European reference levels ,’ Ricognizioni. Rivista Di Lingue E Letterature E Culture Moderne 2 : 111 – 26 . Available at http://www.ojs.unito.it/index.php/ricognizioni/index. Alderson C. J. 1991 . ‘Bands and scores’ in Alderson J. C. , North B. (eds): Language Testing in the 1990s . British Council and Macmillan . pp. 71 – 86 . Alderson C. J. 2007 . ‘ The CEFR and the need for more research ,’ The Modern Language Journal 91 : 658 – 62 . Google Scholar Crossref Search ADS Alderson J. C. , Kremmel B. 2013 . ‘ Re-examining the content validation of a grammar test: The (im)possibility of distinguishing vocabulary and structural knowledge ,’ Language Testing 30 : 535 – 56 . Google Scholar Crossref Search ADS Arnaud P. J. L. 1984 . ‘The lexical richness of L2 written productions and the validity of vocabulary tests’ in Culhane T. , Klein-Braley C. , Stevenson D. K. (eds): Practice and Problems in Language Testing . Department of Language and Linguistics, University of Essex . pp. 14 – 28 . Bachman L. F. , Palmer A. . 1996 . Language Testing in Practice . Oxford University Press . Bachman L. F. , Palmer A. . 2010 . Language Testing in Practice. Developing Language Assessment and Justifying their Use in the Real World . Oxford University Press . Bartning I. , Martin M. , Vedder I. (eds). 2010 . Communicative Proficiency and Linguistic Development: Intersections between SLA and Language Testing Research. EUROSLA Monograph Series, 1. Available at http://eurosla.org/monographs/EM01/EM01tot.pdf. Chapelle C. A. 1999 . ‘Validity in language assessment ,’ Annual Review of Applied Linguistics 19 : 254 – 72 . Google Scholar Crossref Search ADS Council of Europe . 2001 . (ed.). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Available at http://www.coe.int/t/dg4/linguistic/cadre1_en.asp. Council of Europe . 2004 . (ed.). Takala, S., F. Kaftandjieva, N. Verhelst, J. Banerjee, T. Eckes, and F. van der Schoot. Reference Supplement to the Preliminary Pilot Version of the Manual for Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Available at: www.coe.int/lang. Council of Europe . 2009 /2003. (ed.). North, B., N. Figueras, S. Takala, P. Van Avermaet and N. Verhelst. Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Manual. Preliminary Pilot Version. Available at: www.coe.int/lang. Cucchiarini C. , Strik H. , Boyes L. . 2000 . ‘‘ Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology ,’’ Journal of the Acoustical Society of America 107 : 989 – 99 . Google Scholar Crossref Search ADS PubMed De Jong N. H. , Groenhout R. , Schoonen R. , Hulstijn J. H. . 2013 . ‘ Second language fluency: speaking style or proficiency? Correcting measures of second language fluency for first language behavior ,’ Applied Psycholinguistics 36 : 223 – 43 . Google Scholar Crossref Search ADS De Mauro T. (ed.). 2000 . Grande dizionario italiano dell’uso . UTET . De Mauro T. , Mancini F. . 1993 . Lessico di frequenza dell’italiano parlato . Etaslibri . Dörnyei Z. , Scott M. L. . 1997 . ‘ Communication strategies in a second language: definitions and taxonomies. review article ,’ Language Learning 47 : 173 – 210 . Google Scholar Crossref Search ADS Eckes T. 2008 . ‘ Rater types in writing performance assessments: a classification approach to rater variability ,’ Language Testing 25 : 155 – 85 . Google Scholar Crossref Search ADS Foster P. , Tonkyn A. , Wigglesworth G. . 2000 . ‘ Measuring spoken language: a unit for all reasons ,’ Applied Linguistics 21 : 354 – 75 . Google Scholar Crossref Search ADS Fulcher G. 1996 . ‘ Does thick description lead to smart tests? A data-based approach to rating scale construction ,’ Language Testing 13 : 208 – 38 . Google Scholar Crossref Search ADS Fulcher G. 2003 . Testing Second Language Speaking . Longman and Pearson Education . Fulcher G. 2004 . ‘ Deluded by artifices? The common European framework and harmonization ,’ Language Assessment Quarterly an International Journal 1 : 253 – 66 . Google Scholar Crossref Search ADS Fulcher G. 2008 . ‘Criteria for evaluating language quality’ in: Shohamy E. , Hornberger N. H. (eds): Language Testing and Assessment: Encyclopedia of Language and Education , Vol. 7 . Springer . pp. 157 – 76 . Fulcher G. , Davidson F. , Kemp J. . 2011 . ‘ Effective rating scale development for speaking tests: performance decision trees ,’ Language Testing 28 : 5 – 29 . Google Scholar Crossref Search ADS Guiraud P. 1954 . Les caractères statistiques du vocabulaire . Presse Universitaires de France . Harrison J. , Barker F. . 2015 . English Profile in Practice . Cambridge University Press . Harsch C. 2005 . Der Gemeinsame Europäische Referenzrahmen für Sprachen: Leistung und Grenzen. Die Bedeutung des Referenzrahmens im Kontext der Beurteilung von Sprachvermögen am Beispiel des semikreativen Schreibens im DESI-Projekt. Available at http://opus.bibliothek.uni-augsburg.de/opus4/ frontdoor/index/index/docId/297. Hawkins J. A. , Filipovíc L. . 2012 . Criterial Features in L2 English: Specifying the Reference Levels of the Common European Framework . Cambridge University Press . Hilton H. 2008 . ‘ The link between vocabulary knowledge and L2 fluency ,’ Language Learning Journal 36 : 153 – 66 . Google Scholar Crossref Search ADS Hudson Th. 2005 . ‘ Current trends in assessment scales and criterion-referenced language assessment ,’ Annual Review of Applied Linguistics 25 : 205 – 27 . Google Scholar Crossref Search ADS Hulstijn J. H. 2007 . ‘ The shaky ground beneath the CEFR: quantitative and qualitative dimensions of language proficiency ,’ The Modern Language Journal 91 : 663 – 7 . Google Scholar Crossref Search ADS Hulstijn J. H. , Alderson C. , Schoonen R. . 2010 . ‘Developmental stages in second-language acquisition and levels of second-language proficiency: are there links between them?’ in Bartning I. , Martin M. , Vedder I. (eds): Communicative Proficiency and Linguistic Development . pp. 5 – 10 . Jones R. , Tschirner E. . 2006 . A Frequency Dictionary of German: Core Vocabulary for Learners . Routledge . Kane M. T. 2001 . ‘ Current concerns in validity theory ,’ Journal of Educational Measurement 38 : 319 – 42 . Google Scholar Crossref Search ADS Knoch U. 2007 . ‘Do empirically developed rating scales function differently to conventional rating scales for academic writing?,’ in J. S. Johnson (ed). Spaan Fellow Working Papers in Second or Foreign Language Assessment. University of Michigan, 5: pp. 1–36. Knoch U. 2009 . ‘ Diagnostic assessment of writing: a comparison of two rating scales ,’ Language Testing 26 : 275 – 304 . Google Scholar Crossref Search ADS Knoch U. 2011 . ‘ Rating scales for diagnostic assessment of writing: where should the criteria come from? ,’ Assessing Writing 16 : 81 – 96 . Google Scholar Crossref Search ADS Kormos J. 2006 . Speech Production and Second Language Acquisition . Erlbaum . Kuiken F. , Vedder I. (eds). 2014 . ‘ Special issue on assessing oral and written L2 performance: raters’ decisions, rating procedures and rating scales ,’ Language Testing 31 : 279 – 84 . Google Scholar Crossref Search ADS Lennon P. 1991 . ‘ Error. Some problems of definition, identification, and distinction ,’ Applied Linguistics 12 : 180 – 95 . Google Scholar Crossref Search ADS Lennon P. 2000 . ‘The lexical element in spoken second language fluency’ in Riggenbach H. , (ed.): Perspectives on Fluency . Michigan University Press , pp. 25 – 42 . Little D. 2007 . ‘ The common European framework of reference for languages: perspectives on the making of supranational language education policy ,’ The Modern Language Journal 91 : 645 – 55 . Google Scholar Crossref Search ADS Malvern D. , Richards B. , Chipere N. , Durán P. . 2008 . Lexical Diversity and Language Development. Quantification and Assessment . Palgrave Macmillan . McNamara T. , Hill K. , May L. . 2002 . ‘ Discourse and assessment ,’ Annual Review of Applied Linguistics 22 : 221 – 42 . Google Scholar Crossref Search ADS Messick S. 1989 . ‘Validity’ in Linn R. L. , (ed.): Educational Measurement . Macmillan and American Council on Education , pp. 13 – 103 . Nation P. 2001 . Learning Vocabulary in another Language . Cambridge University Press . North B. 1997 . ‘ Perspectives on language proficiency and aspects of competence ,’ Language Teaching 30 : 93 – 100 . Google Scholar Crossref Search ADS North B. 2000 . The Development of a Common Framework Scale of Language Proficiency . Peter Lang . North B. 2007 . ‘ The CEFR illustrative descriptors ,’ The Modern Language Journal 91 : 656 – 9 . Google Scholar Crossref Search ADS North B. 2014a . ‘ Putting the common European framework of reference to good use ,’ Language Teaching 47 : 228 – 49 . Google Scholar Crossref Search ADS North B. 2014b . The CEFR in Practice . Cambridge University Press . O’Loughlin K. 1995 . ‘ Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test ,’ Language Testing 12 : 217 – 37 . Google Scholar Crossref Search ADS Read J. 2000 . Assessing Vocabulary . Cambridge University Press . Read J. 2007 . ‘ Second language vocabulary assessment: current practice and new directions ,’ International Journal of English Studies 7 : 105 – 25 . Read J. , Chapelle C. . 2001 . ‘ A framework for second language vocabulary assessment ,’ Language Testing 18 : 1 – 32 . Google Scholar Crossref Search ADS Rohde L. 1985 . ‘Compensatory fluency: a study of spoken english produced by four Danish learners’ in Glahn E. , Holmen A. (eds): Learner Discourse . University of Copenhagen , pp. 43 – 69 . Schmid H. 1994 . ‘Probabilistic part-of-speech tagging using decision trees’ in Jones D. (ed.): Proceedings of the International Conference on New Methods in Language Processing . University of Manchester , pp. 44 – 9 . Schneider G. , North B. 2000 . Fremdsprachen können—was heißt das? Skalen zur Beschreibung, Beurteilung und Selbsteinschätzung der fremdsprachlichen Kommunikationsfähigkeit . Rüegger . Segalowitz N. 2010 . Cognitive Bases of Second Language Fluency . Routledge . Tschirner E. 2005 . ‘ Das ACTFL OPI und der Europäische Referenzrahmen ,’ Babylonia 2 : 50 – 5 . Tschirner E. 2010 . Grund- und Aufbauwortschatz Italienisch nach Themen . Cornelsen . Wisniewski K. (forthcoming). ‘Empirical learner language and the levels of the common European framework of reference,’ Language Learning. Wisniewski K. 2010 . ‘ Bewertervariabilität im Umgang mit GeRS-Skalen. Ein- und Aussichten aus einem Sprachtestprojekt ,’ Deutsch Als Fremdsprache 3 : 143 – 50 . Wisniewski K. 2012 . Lexikalische Kompetenz in der Fremdsprache testen: Ein Modellierungsversuch , In Abel A. , Vettori C. , Wisniewski K. (eds). pp. 24 – 49 . Wisniewski K. 2013 . ‘The empirical validity of the CEFR fluency scale: the A2 level description’ in Galaczi E. D. , Weir C. (eds): Exploring Language Frameworks: Proceedings of the ALTE Krakow Conference, July 2011 . Cambridge University Press , 253 – 72 . Wisniewski K. 2014 . Die Validität der Skalen des Gemeinsamen europäischen Referenzrahmens für Sprachen. Eine empirische Untersuchung der Flüssigkeits- und Wortschatzskalen des GeRS am Beispiel des Italienischen und des Deutschen . Peter Lang . Wisniewski K. , Schöne K. , Nicolas L. , Vettori C. , Boyd A. , Meurers D. , Hana J. , Abel A. . 2013 . ‘MERLIN: an online trilingual learner corpus empirically grounding the CEFR reference levels in authentic data’ in Conference Proceedings 2013: ICT for Language Learning. Libreriauniversitaria, pp. 12 – 16 . Available at http://conference.pixel-online.net/ICT4LL2013/conferenceproceedings.php Wray A. 2002 . Formulaic Language and the Lexicon . Cambridge University Press . Appendix Table A1: Level descriptions and operationalized scale variables Notes: Descriptors and scale variables derived from the level descriptions (A2-B2) of the three CEFR scales. Table A1: Level descriptions and operationalized scale variables Notes: Descriptors and scale variables derived from the level descriptions (A2-B2) of the three CEFR scales. Table A2: Fluency Measures Fluency aspect Measure Reference (selection) Temporal phenomena Speech rate (syllables/minute) De Jong et al. 2013 Articulation rate (syllables/minute pauses excluded) Cucchiarini et al. (2000) (Exact) phonation-time ratio Segalowitz 2010 Mean length of run (utterances between pauses >250ms) " Pauses/minute and syllable " Mean length of pauses >250 ms Kormos 2006, Segalowitz 2010 Number and length of pauses in different positions (constituent-internal vs clause boundaries) Hilton 2008 Number, mean length, % of all pauses of differently caused pauses CEFR Strategies Message abandonment/token Dörnyei and Scott 1997 for all Circumlocution/token Approximation/token Use of allpurpose words/token Word coinage/token Literal translation/token Foreignizing/token Code-switching/token Verbal strategy marker/token Direct appeal for help/token Use of fillers/token Hesitation phenomena Repetitions/token Kormos 2006 Reformulations/token Fulcher 1996 Self-repairs/token Kormos 2006 False starts/token Rohde 1985 Filled pauses/token Kormos 2006, Segalowitz 2010 Phonetical lengthenings/token Rohde 1985 Keep going/token (keep speaking after a lexical problem) CEFR Fluency aspect Measure Reference (selection) Temporal phenomena Speech rate (syllables/minute) De Jong et al. 2013 Articulation rate (syllables/minute pauses excluded) Cucchiarini et al. (2000) (Exact) phonation-time ratio Segalowitz 2010 Mean length of run (utterances between pauses >250ms) " Pauses/minute and syllable " Mean length of pauses >250 ms Kormos 2006, Segalowitz 2010 Number and length of pauses in different positions (constituent-internal vs clause boundaries) Hilton 2008 Number, mean length, % of all pauses of differently caused pauses CEFR Strategies Message abandonment/token Dörnyei and Scott 1997 for all Circumlocution/token Approximation/token Use of allpurpose words/token Word coinage/token Literal translation/token Foreignizing/token Code-switching/token Verbal strategy marker/token Direct appeal for help/token Use of fillers/token Hesitation phenomena Repetitions/token Kormos 2006 Reformulations/token Fulcher 1996 Self-repairs/token Kormos 2006 False starts/token Rohde 1985 Filled pauses/token Kormos 2006, Segalowitz 2010 Phonetical lengthenings/token Rohde 1985 Keep going/token (keep speaking after a lexical problem) CEFR Table A2: Fluency Measures Fluency aspect Measure Reference (selection) Temporal phenomena Speech rate (syllables/minute) De Jong et al. 2013 Articulation rate (syllables/minute pauses excluded) Cucchiarini et al. (2000) (Exact) phonation-time ratio Segalowitz 2010 Mean length of run (utterances between pauses >250ms) " Pauses/minute and syllable " Mean length of pauses >250 ms Kormos 2006, Segalowitz 2010 Number and length of pauses in different positions (constituent-internal vs clause boundaries) Hilton 2008 Number, mean length, % of all pauses of differently caused pauses CEFR Strategies Message abandonment/token Dörnyei and Scott 1997 for all Circumlocution/token Approximation/token Use of allpurpose words/token Word coinage/token Literal translation/token Foreignizing/token Code-switching/token Verbal strategy marker/token Direct appeal for help/token Use of fillers/token Hesitation phenomena Repetitions/token Kormos 2006 Reformulations/token Fulcher 1996 Self-repairs/token Kormos 2006 False starts/token Rohde 1985 Filled pauses/token Kormos 2006, Segalowitz 2010 Phonetical lengthenings/token Rohde 1985 Keep going/token (keep speaking after a lexical problem) CEFR Fluency aspect Measure Reference (selection) Temporal phenomena Speech rate (syllables/minute) De Jong et al. 2013 Articulation rate (syllables/minute pauses excluded) Cucchiarini et al. (2000) (Exact) phonation-time ratio Segalowitz 2010 Mean length of run (utterances between pauses >250ms) " Pauses/minute and syllable " Mean length of pauses >250 ms Kormos 2006, Segalowitz 2010 Number and length of pauses in different positions (constituent-internal vs clause boundaries) Hilton 2008 Number, mean length, % of all pauses of differently caused pauses CEFR Strategies Message abandonment/token Dörnyei and Scott 1997 for all Circumlocution/token Approximation/token Use of allpurpose words/token Word coinage/token Literal translation/token Foreignizing/token Code-switching/token Verbal strategy marker/token Direct appeal for help/token Use of fillers/token Hesitation phenomena Repetitions/token Kormos 2006 Reformulations/token Fulcher 1996 Self-repairs/token Kormos 2006 False starts/token Rohde 1985 Filled pauses/token Kormos 2006, Segalowitz 2010 Phonetical lengthenings/token Rohde 1985 Keep going/token (keep speaking after a lexical problem) CEFR Table A3: Lexical Measures Indicator Reference (selection) Guiraud‘s Index Guiraud 1954 Advanced Guiraud’s Index Malvern et al. 2008 Lexical density indicator O’Loughlin 1995 Weighted lexical density As above % First k ITA/GER Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010 % Rare words (<4,000) As above % Basic vocabulary (>4,000) As above Lexical error annotation Form-related lexical errors Meaning-related lexical Target hypotheses Domain & extent of errors Target language modification L3/L1 influence Systematicity Lennon 1991; Nation 2001, Wisniewski 2012, 2014; Wray 2002 Error ratio 1 (lexical error types/token) Error ratio 2 (lexical error tokens/token) % error-free AS units Foster et al. 2000 Lexical Quality Indicator Arnaud 1984 Lexical errors in 1,000/4,000 most frequent words and rare words (<4,000)/token Indicator Reference (selection) Guiraud‘s Index Guiraud 1954 Advanced Guiraud’s Index Malvern et al. 2008 Lexical density indicator O’Loughlin 1995 Weighted lexical density As above % First k ITA/GER Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010 % Rare words (<4,000) As above % Basic vocabulary (>4,000) As above Lexical error annotation Form-related lexical errors Meaning-related lexical Target hypotheses Domain & extent of errors Target language modification L3/L1 influence Systematicity Lennon 1991; Nation 2001, Wisniewski 2012, 2014; Wray 2002 Error ratio 1 (lexical error types/token) Error ratio 2 (lexical error tokens/token) % error-free AS units Foster et al. 2000 Lexical Quality Indicator Arnaud 1984 Lexical errors in 1,000/4,000 most frequent words and rare words (<4,000)/token Table A3: Lexical Measures Indicator Reference (selection) Guiraud‘s Index Guiraud 1954 Advanced Guiraud’s Index Malvern et al. 2008 Lexical density indicator O’Loughlin 1995 Weighted lexical density As above % First k ITA/GER Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010 % Rare words (<4,000) As above % Basic vocabulary (>4,000) As above Lexical error annotation Form-related lexical errors Meaning-related lexical Target hypotheses Domain & extent of errors Target language modification L3/L1 influence Systematicity Lennon 1991; Nation 2001, Wisniewski 2012, 2014; Wray 2002 Error ratio 1 (lexical error types/token) Error ratio 2 (lexical error tokens/token) % error-free AS units Foster et al. 2000 Lexical Quality Indicator Arnaud 1984 Lexical errors in 1,000/4,000 most frequent words and rare words (<4,000)/token Indicator Reference (selection) Guiraud‘s Index Guiraud 1954 Advanced Guiraud’s Index Malvern et al. 2008 Lexical density indicator O’Loughlin 1995 Weighted lexical density As above % First k ITA/GER Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010 % Rare words (<4,000) As above % Basic vocabulary (>4,000) As above Lexical error annotation Form-related lexical errors Meaning-related lexical Target hypotheses Domain & extent of errors Target language modification L3/L1 influence Systematicity Lennon 1991; Nation 2001, Wisniewski 2012, 2014; Wray 2002 Error ratio 1 (lexical error types/token) Error ratio 2 (lexical error tokens/token) % error-free AS units Foster et al. 2000 Lexical Quality Indicator Arnaud 1984 Lexical errors in 1,000/4,000 most frequent words and rare words (<4,000)/token © Oxford University Press 2017 This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied Linguistics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/the-empirical-validity-of-the-common-european-framework-of-reference-fq1CLVDeaJ

Loading next page...

References (54)

H. Riggenbach (2000)
Perspectives on Fluency
G. Fulcher (1996)
Does thick description lead to smart tests? A data-based approach to rating scale construction
Language Testing, 13
J. Vizmuller-Zocco, T. Mauro, Federico Mancini, M. Vedovelli, M. Voghera (1994)
Lessico di frequenza dell'italiano parlato
, 71
A. Wray (2002)
Formulaic Language and the Lexicon: List of Figures and Tables
David Little (2007)
The Common European Framework of Reference for Languages: Perspectives on the Making of Supranational Language Education Policy
The Modern Language Journal, 91
Erwin Tschirner, Randall Jones (2015)
A Frequency Dictionary of German: Core Vocabulary for Learners
G. Fulcher (2003)
Testing Second Language Speaking
B. North (2000)
The development of a common framework scale of language proficiency
C. Chapelle (1999)
VALIDITY IN LANGUAGE ASSESSMENT
Annual Review of Applied Linguistics, 19
J. Read, C. Chapelle (2001)
A framework for second language vocabulary assessment
Language Testing, 18
C. Cucchiarini, H. Strik, L. Boves (2000)
Quantitative assessment of second language learners' fluency by means of automatic speech recognition technology.
The Journal of the Acoustical Society of America, 107 2
J. Alderson (2007)
The CEFR and the Need for More Research
The Modern Language Journal, 91
(1991)
citation_publisher=British Council and Macmillan, ; Language Testing in the 1990s
U. Knoch (2009)
Diagnostic assessment of writing: A comparison of two rating scales
Language Testing, 26
(2010)
citation_publisher=Oxford University Press, ; Language Testing in Practice. Developing Language Assessment and Justifying their Use in the Real World
I. Nation (2001)
Learning Vocabulary in Another Language: Frontmatter
T. Hudson (2005)
TRENDS IN ASSESSMENT SCALES AND CRITERION-REFERENCED LANGUAGE ASSESSMENT
Annual Review of Applied Linguistics, 25
T. McNamara, Kathryn Hill, Lyn May (2002)
Discourse and Assessment.
, 22
N. Segalowitz (2010)
Cognitive Bases of Second Language Fluency
U. Knoch (2011)
Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from?
Assessing Writing, 16
Andreas Abel (2014)
A Trilingual Learner Corpus illustrating European Reference Levels
, 1
G. Fulcher, F. Davidson, Jenny Kemp (2011)
Effective rating scale development for speaking tests: Performance decision trees
Language Testing, 28
Heather Hilton (2008)
The link between vocabulary knowledge and spoken L2 fluency
The Language Learning Journal, 36
J. Hawkins, Luna Filipović (2012)
Criterial Features in L2 English: Specifying the Reference Levels of the Common European Framework
Evelina Dimitrova-Galaczi, C. Weir (2013)
Exploring language frameworks : proceedings of the ALTE Kraków Conference, July 2011
Katrin Wisniewski (2014)
Die Validität der Skalen des Gemeinsamen europäischen Referenzrahmens für Sprachen : eine empirische Untersuchung der Flüssigkeits- und Wortschatzskalen des GeRS am Beispiel des Italienischen und des Deutschen
Judit Kormos (2006)
Speech production and second language acquisition
B. North (2014)
The CEFR in Practice
J. Hulstijn (2007)
The shaky ground beneath the CEFR: Quantitative and qualitative dimensions of language proficiency
The Modern Language Journal, 91
K. O’Loughlin (1995)
Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test
Language Testing, 12
(1984)
citation_publisher=Department of Language and Linguistics, University of Essex, ; Practice and Problems in Language Testing
(1996)
citation_publisher=Oxford University Press, ; Language Testing in Practice
B. North (2011)
Putting the Common European Framework of Reference to good use
Language Teaching, 47
G. Fulcher (2004)
Deluded by Artifices? The Common European Framework and Harmonization
Language Assessment Quarterly, 1
Z. Dörnyei, Mary Call (1997)
Review Article Communication Strategies in a Second Language: Definitions and Taxonomies
Lyle Bachman, A. Palmer (1998)
语言测试实践 = Language testing in practice
J. Read (2007)
Second language vocabulary assessment: current practices and new directions
International journal of english studies, Vol, 7
J. Harrison, F. Barker (2015)
English Profile in Practice
B. North (2007)
The CEFR Illustrative Descriptor Scales
The Modern Language Journal, 91
B. North (1997)
Perspectives on language proficiency and aspects of competence
Language Teaching, 30
T. Eckes (2008)
Rater types in writing performance assessments: A classification approach to rater variability
Language Testing, 25
P. Lennon (1991)
Error: Some Problems of Definition, Identification, and Distinction
Applied Linguistics, 12
(2012)
citation_publisher=Eurac, ; Gli studenti altoatesini e la seconda lingua: indagine linguistica e psicosociale./Die Südtiroler SchülerInnen und die Zweitsprache: eine linguistische und sozialpsychologische Untersuchung
, 2
F. Kuiken, I. Vedder (2014)
Raters’ decisions, rating procedures and rating scales
Language Testing, 31
D. Malvern, B. Richards, N. Chipere, P. Durán (2004)
Lexical Diversity and Language Development: Quantification and Assessment
Nivja Jong, Rachel Groenhout, R. Schoonen, J. Hulstijn (2015)
Second language fluency: speaking style or proficiency? Correcting measures of second language fluency for first language behavior
Applied Psycholinguistics, 36
J. Alderson, B. Kremmel (2013)
Re-examining the content validation of a grammar test: The (im)possibility of distinguishing vocabulary and structural knowledge
Language Testing, 30
(2016)
L'épreuve cantonale de référence d'allemand : une nouveauté en Pays de Vaud
J. Read (2000)
Assessing Vocabulary: Acknowledgements
Pauline Foster, Alan Tonkyn, Gillian Wigglesworth (2000)
Measuring spoken language: a unit for all reasons
Applied Linguistics, 21
T. McNamara, Kathryn Hill, Lyn May (2002)
12. DISCOURSE AND ASSESSMENT
Annual Review of Applied Linguistics, 22
T. Mauro (2000)
Grande dizionario italiano dell'uso
, 77
R. Linn (1989)
Educational measurement, 3rd ed.
M. Kane (2001)
Current Concerns in Validity Theory
Journal of Educational Measurement, 38

Publisher: Oxford University Press
Copyright: © Oxford University Press 2017
ISSN: 0142-6001
eISSN: 1477-450X
DOI: 10.1093/applin/amw057
Publisher site: See Article on Publisher Site

Abstract

Abstract In spite of the widespread use of Common European Framework of Reference for language learning, teaching, and assessment (CEFR) scales, there is an overwhelming lack of evidence regarding their power to describe empirical learner language (Fulcher 2004; Hulstijn 2007). This article presents results of a study that focused on the empirical robustness (i.e. the power of level descriptions to capture what learners actually do in a language test) of the CEFR vocabulary & fluency scales (A2-B2). Data stem from an Italian & German oral proficiency test (Abel et al. 2012). Results show that the empirical robustness was flawed: Some scale contents were hardly observable or so evenly distributed that they could not distinguish between learners. Contradictory/weak correlations among scale features and heterogeneous cluster solutions suggest that scales did not consistently capture typical learner behaviour. Often, learner language could not be objectively described by any level description. Also, it was only partially possible to link scale contents to research-based measures of fluency and vocabulary. Given the importance of CEFR levels in many high-stakes contexts, the results suggest the need of a large empirical validation project. INTRODUCTION The Common European Framework of Reference for language learning, teaching, and assessment (CEFR, CoE 2001) has become the most important yardstick for the development of language tests, curricula, educational standards, and textbooks in Europe. There is virtually no (European) high-stakes test not related to the framework. Many crucial decisions about people’s lives are based on CEFR level descriptions. Thus, even if the scale system is only one component of the CEFR, which as a whole takes on a much broader educational and political perspective, it is of fundamental importance to deliver evidence for the possibility to arrive at fair descriptions of learner language with the help of the scales. While there are many open questions related to the CEFR scales (e.g. their relationship to second language acquisition and their theoretical foundations, including the concept of proficiency they are built upon) that are not discussed here, this article focuses on some aspects of empirical validity, roughly understood here as the usefulness of CEFR descriptors to describe authentic learner language. Until now, there is hardly any evidence for the empirical validity of the CEFR scales (Fulcher 2004; Alderson 2007; Hulstijn 2007; Little 2007; Hulstijn et al. 2010, Wisniewski 2013, 2014). As this insecurity is a consequence of the methodology chosen to calibrate the CEFR scales, the article will first briefly sum up the main steps of the scaling process. In that complex procedure, which lay great emphasis on the perceptions of human raters, descriptor contents were not matched onto empirical learner language so that it is unclear if and to what degree CEFR descriptors capture what (different) learners do (in differing contexts, tasks, and target languages). This article aims at tackling the problem of empirical validity of the A2-B2 level descriptions of three selected CEFR scales, that is, vocabulary range and control and spoken fluency, in an exemplary context of an oral proficiency test for German and Italian as L2. Its aim is to examine whether in this context, the level descriptions are empirically robust. Level descriptions are tentatively considered empirically robust in a given context if the concepts they contain are reliably observable in learner texts, and if they allow to group L2 productions coherently and unambiguously into CEFR levels. It is considered a further argument in favour of scale validity if the descriptor contents can be empirically related to research-based measures of the underlying constructs of fluency and vocabulary range and control. In order to establish a direct relationship between learner texts and the CEFR scales and to avoid circular argumentation, all analyses are carried out without taking into consideration human ratings. CEFR SCALES AND EMPIRICAL VALIDITY ISSUES A central point of uncertainty about the empirical validity of the CEFR scales regards their questionable adequateness for the description of learner language. This problem results from the scale calibration methodology employed in the so-called ‘Swiss Project’ (1993–1996; North 2000, 2014b: 14–21; Schneider and North 2000; Council of Europe (CoE) 2001: 217–25). The procedure can be roughly summed up as follows: In a first, intuitive project stage, approximately 2,000 descriptors of L2 competence were pooled. These ‘can-do statements’ were collected from diverse tests of English as an L2, and extracted from different types of scales. The qualitative, second project stage involved 300 Swiss teachers who in 32 workshops sorted the descriptors according to perceived categories of L2 proficiency. The descriptors they best agreed upon were then used used to assess (oral) language productions. In a third, quantitative step, Multi-Facet Rasch Analysis (CoE 2004: Section H) was applied to calibrate descriptors with sufficient statistical qualities (reliability) on one common logit scale. That scale was subdivided into the now well-known six common reference levels of the CEFR (A1-A2-B1-B2-C1-C2, Schneider and North 2000:153); an exemplary scale is presented in Table 1 Table 1: Vocabulary range scale of the CEFR (CoE 2001: 112) C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning. C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms. B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can vary formulations to avoid frequent repetitions, but lexical gaps can still cause hesitation and circumlocution. B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to his/her everyday life such as family, hobbies and interests, work, travel, and current events. A2+ Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics. A2 Has a sufficient vocabulary for the expression of basic communicative needs. Has a sufficient vocabulary for coping with simple survival needs. A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete situations. C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning. C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms. B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can vary formulations to avoid frequent repetitions, but lexical gaps can still cause hesitation and circumlocution. B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to his/her everyday life such as family, hobbies and interests, work, travel, and current events. A2+ Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics. A2 Has a sufficient vocabulary for the expression of basic communicative needs. Has a sufficient vocabulary for coping with simple survival needs. A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete situations. Table 1: Vocabulary range scale of the CEFR (CoE 2001: 112) C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning. C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms. B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can vary formulations to avoid frequent repetitions, but lexical gaps can still cause hesitation and circumlocution. B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to his/her everyday life such as family, hobbies and interests, work, travel, and current events. A2+ Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics. A2 Has a sufficient vocabulary for the expression of basic communicative needs. Has a sufficient vocabulary for coping with simple survival needs. A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete situations. C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning. C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms. B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can vary formulations to avoid frequent repetitions, but lexical gaps can still cause hesitation and circumlocution. B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to his/her everyday life such as family, hobbies and interests, work, travel, and current events. A2+ Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics. A2 Has a sufficient vocabulary for the expression of basic communicative needs. Has a sufficient vocabulary for coping with simple survival needs. A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete situations. In this scaling approach, teacher decisions are of central importance. They are used as data in the statistical procedure (the Rasch model) which serves as an arbiter in determining descriptor quality (Fulcher et al. 2011: 7). The vertical dimension of the CEFR scales thus mirrors ‘scaled teacher perceptions’ (North 2014b: 23). The scales contain only descriptors the difficulty of which teachers could reliably agree upon, so that they are built completely on practitioners’ beliefs. Although it is important for a scale to be plausible for its users, this mono-dimensional scaling methodology potentially threatens validity, and for several reasons. What is most important here is the fact that no learner language was analysed to examine whether the descriptors can be meaningfully applied to authentic data (Tschirner 2005: 55; Hulstijn 2007; Little 2007: 648; Fulcher 2008). Without the link to empirical learner language, though, the usefulness of the descriptors is at stake: ‘If descriptors are to be meaningful characterizations of ability, then they should be able to be related to actual performance.’ (Alderson 1991: 74). A second drawback is the fact that the scaling procedure was not construct-driven, that is, descriptors were not derived from models of language ability or theories of second language acquisition, and no empirical SLA findings were integrated. However, an attempt at relating the horizontal scale categories (e.g. fluency, grammatical accuracy) to taxonomies of proficiency models was made (North 1997; North 2000: 20; 123–9; North 2014b: 22–5). The potential threat to theoretical scale validity is not discussed further here for reasons of space (see, e.g. North 2014b: 22–5; Hulstijn 2007; Wisniewski 2014).1 Third, it is a benefit of the CEFR scales to be usable in a reliable way by raters. However, this reliability is only meaningful if evidence also suggests that raters actually refer to the scale contents in decision-making. Research on the validity of rating behaviour has long revealed many problems, even with trained raters (e.g. Eckes 2008; Wisniewski 2010, 2014; Alderson and Kremmel 2013; Kuiken and Vedder 2014). Hence, when analysing CEFR-based ratings, validity is not to be confused with reliability, the former being still in need of much more research attention. VALIDATION APPROACH Validity is a complex concept. In language assessment, it is not considered a property of a language test, but ‘(…) an integrated evaluative judgment of the degree to which empirical evidences and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment’ (Messick 1989: 13). As a consequence, validity arguments in the field of language testing normally focus on delivering evidence for the appropriateness of interpretations and consequences based on test scores (Messick 1989; Chapelle 1999; Kane 2001; Bachman and Palmer 2010), whereas rating scale validity aspects are rarely analysed in their own right (exceptions are Harsch 2005; Knoch 2007, 2009, 2011). Regardless of that, the CEFR scales have assumed a peculiar role that could be defined as ‘pseudo constructs’, thus contributing considerably to the overall validity of test interpretations (McNamara et al. 2002; Knoch 2007). This underscores the importance of specific CEFR scale validation studies. Surprisingly, to the knowledge of the author there are no validation studies that analyse learner language through the lense of operationalized CEFR scales. Instead, an increasing number of research projects aims at illustrating the meaning of CEFR levels by looking for linguistic correlates of ratings in learner texts (e.g. Bartning et al. 2010; Harrison and Barker 2015; for a discussion, see Wisniewski forthcoming). This methodology must be clearly distinguished from attempts at scale validation like the one presented here. In the ‘criterial feature’ perspective (e.g. Hawkins and Filipovíc 2012), CEFR scales are not questioned for validity. Rather, CEFR-based ratings are used as pre-classifications of learner texts (most often stemming from language tests) on the basis of which linguistic features discriminating between CEFR levels, or rather, between CEFR-based ratings, are searched. This ‘criterial features’ approach is extremely helpful for the validation of ratings, while it does not directly regard the relationship between CEFR scales and learner language, that is, their empirical validity.2 The present study relies on a newly developed approach for empirical CEFR scale validation some elements of which are explained in more detail here. One important premise was the avoidance of human ratings, which in any scale validation effort are problematic. As mentioned above, even professional ratings are known to be biased (Eckes 2008; Wisniewski 2010, 2014; Alderson and Kremmel 2013; Kuiken and Vedder 2014). Even very reliable ratings need not be valid, that is, raters do not necessarily have to refer to the rating instruments at hand in order to arrive at unanimous decisions. Therefore, the validation approach used here works with operationalized contents from CEFR descriptors to analyse learner language directly through the lense of the scales. To understand the focus of this validation study, it is important to remember that CEFR scales are not meant to be used for language testing in their published form. Any testing context requires modifications to adapt the scales to specific test purposes and users (Alderson 1991; Harsch 2005; North 2014a: 244). However, independently of concrete language tests (or any other context), and before putting modified scale versions to use, early on we would want to be sure that what is already there is actually relatable to learner language, that is, empirically robust. The validation approach used here focuses on this very fundamental element of validity. However, even if empirical robustness evidence could be provided, this would still be no guarantee for the development of a concrete valid test as a whole. Empirical robustness evidence can only support the assumption that it is in principle possible to use the CEFR scales in a valid way. Furthermore, validation studies necessarily have to limit themselves to a specific language elicitation context. Learner language is produced in an immense variety of contexts, so that it is crucial to specify what we can expect CEFR scale contents to be relatable to. Theoretically, chapter five scales like the ones considered here claim to be connected to underlying constructs of ‘communicative language competence’ (North 2000: 20, 28; CoE 2001). Therefore, they should be applicable independently of concrete (test) tasks (‘context-free’, see Hudson 2005: 209–10) and relevant to a large number of contexts (Fulcher 1996: 44; North 2007: 658). Here, CEFR descriptors are applied to two oral proficiency test tasks thoroughly related to the CEFR (see below for details). This is considered a typical, widely encountered use context of CEFR-based learner language classification. It is exactly in this type of context that (modified) CEFR scales are regularly used, often with serious consequences for test takers’ lives. However, this is but one of a great number of possible contexts. Furthermore, it is not the aim to shed light on the relationship between language proficiency as a whole, or of second language development, and the CEFR scales, where much more and different types of data would be needed. Apart from the restricted sample size that does not allow for generalizations, another constraint of the study is the focus on empirical robustness alone. In a more comprehensive validation approach, the theoretical foundations of the scales would have to be questioned as well. Another aspect not addressed here, but strongly intertwined with empirical robustness, is the question whether we can rely on the validity of human ratings as reflecting rating scale contents. These aspects are both discussed in Wisniewski (2014). METHODOLOGY To address the empirical validity of the CEFR vocabulary and fluency scales (A2-B2, Italian (ITA) and German (GER) as L2) in a language test, the empirical robustness concept was broken down into the three following main assumptions: If the CEFR scales claim empirical validity, (i) the concepts mentioned in the level descriptions must be observable. These concepts are not spelled out in the CEFR so that they need to be operationalized carefully. Level descriptions are usually based on more than one such concept (e.g. the number and severeness of lexical errors), and for each concept, a prediction is made regarding its quantity (e.g. ‘good/sufficient/basic range of vocabulary’) and/or its more qualitative contextual framing (e.g. ‘vocabulary sufficient for survival needs’). A second prerequisite for empirical validity is the possibility to clearly (ii) group individual learner texts with the help of these predictions. Third, it would strengthen the validity if (iii) links between scale contents and research-based construct measures were shown. In addition, no greater differences between the target languages (ITA/GER) and the tasks should be found. An oral proficiency test with 98 South Tyrolean 17–18-year-old high-school pupils (L1 GER/ITA, L2 ITA/GER) was carried through in a multimethod large-scale language assessment project (Abel et al. 2012). All participants had comparable L2 learning backgrounds. In the test, the participants did a monologue task for which they had to choose one out of four pictures each showing a person. They were asked to describe what they see and to say most they could about the life of the person they had chosen. They had three minutes to prepare themselves. The instructions gave some hints at possible topics. The interviewer was advised to intervene only in case of long silent stretches to suggest further aspects participants could talk about. In the dialogue task, test takers had to choose one of two topics (reality TV shows and horoscopes). They were asked to describe their personal experience and then to critically discuss the topic with the interviewer. Again, they had three minutes to prepare themselves. The total duration was of approximately 15 minutes. All productions were rated independently by two trained raters. Quality standards in the test development, administration, and evaluation process were respected (Bachman and Palmer 1996, 2010; CoE 2004, 2009/2003; Fulcher 2003; see report on full project, Abel et al. 2012). Out of this database, 19 productions each containing two tasks were chosen for this study. Selected candidates produced a sufficient amount of clearly intelligible speech. Additionally, they received highly reliable ratings for vocabulary and fluency. Their rating profiles were flat, that is, they had similar CEFR ratings for the rating criteria applied (see Wisniewski 2014 for details). For the analyses, audio recordings were transcribed in the multi-layer standoff editor ELAN in CHAT. Then, the A2-B2 level descriptions were operationalized (see appendix, Table A1) in order to establish a direct link between CEFR scales and learner language. Unavoidably, this required interpretation. Subjective (e.g. ‘regular interaction with native speakers quite possible’, fluency, B2) and self-referential aspects (e.g. ‘can interact with a degree of fluency…’, fluency, B2) were excluded. In case of translation ambiguity, the English CEFR version was used. At times, several interpretations were plausible; then, all were operationalized and cross-checked. In some cases, ‘inverted’ concepts were used in order to make them more easily comparable across levels. If, for example, the scale claimed that it was typical for a learner not to show breakdowns in communication, the scale variable would count those breakdowns (normalized, i.e. per utterance and word token). Operationalized descriptor contents were termed ‘scale variables’. Many of them are uncommon, subjective, and/or hard to reliably observe, but they reflect the scale contents as directly as possible. One exemplary level operationalization is described in Wisniewski (2013); the complete rationale can be found in Wisniewski (2014). In addition to the scale variables, a considerable number of research-based indicators were used (e.g. lexical density and sophistication measures, mean length of runs, phonation-time-ratio and many others; see appendix, Tables A2 and A3). The annotation was carried through by two coders independently (inter-rater reliability: Pearson’s contingency coefficient C =0.899, Cohen’s Kappa κ =0.773). For all annotations the coders had not agreed upon, a consensus was formed. The corpus was manually segmented into syllables, and WordSmith and the Stuttgart TreeTagger (Schmid 1994) were used for tokenization and lemmatization. The fully annotated corpus is available on a DVD from the author. A variety of statistical procedures was used. Descriptive statistics was employed to evaluate (i) the observability of scale variables. Relative frequencies were calculated with regard to appropriate units of reference, for example, word tokens, AS-units (Foster, Tonkyn, and Wigglesworth 2000), or syllables. To assess (ii) the consistency of level descriptions, correlations among scale variables were calculated (Pearson’s r). Cluster analyses helped group candidates according to operationalized scale contents (Ward and k-means methods). Cluster analysis is a statistical procedure that detects structures in data without pre-classifications (such as human ratings) with regard to pre-defined variables (in this case, scale variables per level and scale). It groups cases (here: learner texts) most similar to each other in terms of scale variables of a level description. In other words, these groups (termed ‘CEFR clusters’) can be understood as an application of the operationalizable parts of the scale contents. The statistical quality of cluster solutions is determined by the (minimized) variance inside the clusters as compared to the variance in the whole population. However, the content plausibility of the cluster solutions plays an equally important role for the present study as will become clear below. Discriminant analyses delivered Wilks’ Lambda (Λ), an inverse measure of variance that estimates the explanatory power of the scale variables in separating clusters from one another. Wilks’ Lambda helps understand how easy it would be to distinguish candidates with the help of the scale variables alone. (iii) To link scale variables to underlying constructs and to search for evidence of the construct-relevance of the clusters, in a first step correlations between the scale variables and research-based measures were calculated. Furthermore, T-tests and discriminant analyses were run to analyse whether the clusters (see ii, above) were relatable to more established measures of the relevant constructs. RESULTS Observability of scale variables Scale variables were generally easier to observe in the dialogue task (T2), maybe due to its construction or simply to the fact that it elicited more language (Table 2). Table 2: Differences in task length Length T1 SD T1 T2 SD T2 Mean length of production, tokens 160.39 62.39 330.26 112.57 Mean length of productions, minutes 4.1 .78 7.28 1.02 Length T1 SD T1 T2 SD T2 Mean length of production, tokens 160.39 62.39 330.26 112.57 Mean length of productions, minutes 4.1 .78 7.28 1.02 Notes: SD = standard deviation, T = task. Table 2: Differences in task length Length T1 SD T1 T2 SD T2 Mean length of production, tokens 160.39 62.39 330.26 112.57 Mean length of productions, minutes 4.1 .78 7.28 1.02 Length T1 SD T1 T2 SD T2 Mean length of production, tokens 160.39 62.39 330.26 112.57 Mean length of productions, minutes 4.1 .78 7.28 1.02 Notes: SD = standard deviation, T = task. T1 elicited rather short answers, which is not uncommon in language testing. The task was constructed following the example of the Swiss Project. Fluency Some aspects of the fluency scale were hardly observable, above all false starts and incomprehensible utterances (Table 3; appendix, Table A1). Table 3: Observability of fluency scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Pauses A2-B2 54.86 13.07 88.24 24.36 False starts A2 0.42 0.6 1.47 1.6 ‘B1 pauses’ B1 7.26 6.76 16.68 8.64 Incomprehensible utterances B1 0.43 0.67 3.53 2.78 ‘B2 pauses’ B2 5.37 5.87 13.16 7.65 Long pauses B2 4 2.93 3.05 3.01 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Pauses A2-B2 54.86 13.07 88.24 24.36 False starts A2 0.42 0.6 1.47 1.6 ‘B1 pauses’ B1 7.26 6.76 16.68 8.64 Incomprehensible utterances B1 0.43 0.67 3.53 2.78 ‘B2 pauses’ B2 5.37 5.87 13.16 7.65 Long pauses B2 4 2.93 3.05 3.01 Notes: High SD values to be expected for different proficiency levels; all values per production. Table 3: Observability of fluency scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Pauses A2-B2 54.86 13.07 88.24 24.36 False starts A2 0.42 0.6 1.47 1.6 ‘B1 pauses’ B1 7.26 6.76 16.68 8.64 Incomprehensible utterances B1 0.43 0.67 3.53 2.78 ‘B2 pauses’ B2 5.37 5.87 13.16 7.65 Long pauses B2 4 2.93 3.05 3.01 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Pauses A2-B2 54.86 13.07 88.24 24.36 False starts A2 0.42 0.6 1.47 1.6 ‘B1 pauses’ B1 7.26 6.76 16.68 8.64 Incomprehensible utterances B1 0.43 0.67 3.53 2.78 ‘B2 pauses’ B2 5.37 5.87 13.16 7.65 Long pauses B2 4 2.93 3.05 3.01 Notes: High SD values to be expected for different proficiency levels; all values per production. Table 4: Observability of vocabulary range scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Communication problems regarding communicative needs/survival needs A2 0.11 0.31 0 0 Communication problems in basic/everyday life communication B1 0.21 0.54 0.58 0.84 Circumlocutions B1, B2 0.16 0.37 1 1.41 Communication problems regarding own field/general topics B2 0.37 0.76 1.21 1.27 Pauses for lexical planning B2 4.65 4.95 12.68 7.99 Repetitions B2 5.47 3.47 13.26 9.5 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Communication problems regarding communicative needs/survival needs A2 0.11 0.31 0 0 Communication problems in basic/everyday life communication B1 0.21 0.54 0.58 0.84 Circumlocutions B1, B2 0.16 0.37 1 1.41 Communication problems regarding own field/general topics B2 0.37 0.76 1.21 1.27 Pauses for lexical planning B2 4.65 4.95 12.68 7.99 Repetitions B2 5.47 3.47 13.26 9.5 Notes: High SD values to be expected for different proficiency levels; all values per production. Table 4: Observability of vocabulary range scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Communication problems regarding communicative needs/survival needs A2 0.11 0.31 0 0 Communication problems in basic/everyday life communication B1 0.21 0.54 0.58 0.84 Circumlocutions B1, B2 0.16 0.37 1 1.41 Communication problems regarding own field/general topics B2 0.37 0.76 1.21 1.27 Pauses for lexical planning B2 4.65 4.95 12.68 7.99 Repetitions B2 5.47 3.47 13.26 9.5 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Communication problems regarding communicative needs/survival needs A2 0.11 0.31 0 0 Communication problems in basic/everyday life communication B1 0.21 0.54 0.58 0.84 Circumlocutions B1, B2 0.16 0.37 1 1.41 Communication problems regarding own field/general topics B2 0.37 0.76 1.21 1.27 Pauses for lexical planning B2 4.65 4.95 12.68 7.99 Repetitions B2 5.47 3.47 13.26 9.5 Notes: High SD values to be expected for different proficiency levels; all values per production. Other phenomena were not as salient as suggested by the scale: the lexical and grammatical planning and repair pauses defined typical for level B1 and B2, respectively (Table A1, appendix), represented less then a fifth of all learners’ pauses (B1: mean 16.29%, SD 10.78; B2: mean 13.1%, SD 9.99). Furthermore, assigning pausing reasons was problematic and speculative: half of all pauses (55.45%, SD 10.68) remained unexplained, and for the rest, inter-coder reliability was relatively low (C = 0.936 but κ =0.51). The B1 level assumption that the number of pauses is dependent on the length of the run proved a general tendency in the sample. Almost all learners paused considerably more in long utterances (mean of 1.85 times more pauses, SD 0.32). Lower utterance fluency in longer runs, then, might be due to general cognitive speech production processes. Vocabulary range Most problems in the vocabulary range scale (Table 4) were caused by scale variables linked to specific lexical fields and communicative functions (A2: basic communicative/simple survival needs; B1: basic vocabulary/everyday life; B2: own fields, general topics), particularly on levels A2 and B1. For methodological reasons, these descriptors were inverted in the operationalization. Hence, the resulting scale variables count failures to communicate in the lexical field under consideration (appendix, Table A1).3 All learners nearly constantly communicated successfully in all these lexical fields. Only two speakers had each one problem to make hilmself understood with regard to the (only!) A2 scale variable, while for B1, three learners in T1 (each once) and 5 learners in T2 (N = 8) did not get their messages across. These scale variables were not helpful to distinguish the learners from one another. Vocabulary control Generally, the scale variables were well observable, particularly in T2 (Table 5, appendix, Table A1). Table 5: Observability of vocabulary control scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Errors in every day needs vocabulary A2 2.16 1.83 1.26 1.19 Errors in elementary vocabulary (>4.000) B1 3 2.4 7.74 3.31 Errors in ‘rare’ vocabulary (<4.000) B1 0.58 1.12 0.84 1.12 Major errors B1 1.63 1.39 2.79 1.93 Lexical errors B2 3.6 2.2 8.6 3.2 Synforms B2 0.63 1.12 0.74 1.1 Incorrect word choices B2 2.84 2.39 6.37 2.56 Lexical errors that hinder communication B2 0.37 0.76 1.42 1.5 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Errors in every day needs vocabulary A2 2.16 1.83 1.26 1.19 Errors in elementary vocabulary (>4.000) B1 3 2.4 7.74 3.31 Errors in ‘rare’ vocabulary (<4.000) B1 0.58 1.12 0.84 1.12 Major errors B1 1.63 1.39 2.79 1.93 Lexical errors B2 3.6 2.2 8.6 3.2 Synforms B2 0.63 1.12 0.74 1.1 Incorrect word choices B2 2.84 2.39 6.37 2.56 Lexical errors that hinder communication B2 0.37 0.76 1.42 1.5 Notes: High SD values to be expected for different proficiency levels; all values per production. Table 5: Observability of vocabulary control scale variables Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Errors in every day needs vocabulary A2 2.16 1.83 1.26 1.19 Errors in elementary vocabulary (>4.000) B1 3 2.4 7.74 3.31 Errors in ‘rare’ vocabulary (<4.000) B1 0.58 1.12 0.84 1.12 Major errors B1 1.63 1.39 2.79 1.93 Lexical errors B2 3.6 2.2 8.6 3.2 Synforms B2 0.63 1.12 0.74 1.1 Incorrect word choices B2 2.84 2.39 6.37 2.56 Lexical errors that hinder communication B2 0.37 0.76 1.42 1.5 Scale variable CEFR level T1 mean T1 SD T2 mean T2 SD Errors in every day needs vocabulary A2 2.16 1.83 1.26 1.19 Errors in elementary vocabulary (>4.000) B1 3 2.4 7.74 3.31 Errors in ‘rare’ vocabulary (<4.000) B1 0.58 1.12 0.84 1.12 Major errors B1 1.63 1.39 2.79 1.93 Lexical errors B2 3.6 2.2 8.6 3.2 Synforms B2 0.63 1.12 0.74 1.1 Incorrect word choices B2 2.84 2.39 6.37 2.56 Lexical errors that hinder communication B2 0.37 0.76 1.42 1.5 Notes: High SD values to be expected for different proficiency levels; all values per production. Unexpectedly, almost all learners made lexical errors in communicating every day needs (A2). The B1 level description contains the assumption that more ‘major errors’4 are made when speakers express more complex thoughts. Although intuitive, there is no data to support this hypothesis. Interestingly, T1 yielded a high percentage of errors to be considered ‘major’ (42.26%, SD 30.82), whereas in the more complex T2 in which a controversial issue was to be discussed (‘horoscopes’ or ‘TV talent shows’), the percentage was lower (33.32%, SD 19.8). This is true although the mean number of lexical errors per token is almost leveled in both tasks (T1: 0.3 (SD 0.3); T2: 0.3 (SD 0.2)). A possible explanation is that strategic competence might enable learners to control lexical correctness even when confronted with a more complex topic (‘islands of reliability’, Rohde 1985). The assumption that learners make more lexical errors outside the ‘comfort zone’ of elementary vocabulary (B1) is problematic because lexical items rarer than the first 4,000 words (‘rare words’) of ITA/GER (Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010) were hardly observable. Even if the participants made more errors in rare words than was to be expected from their total occurrence (i.e. 27 out of 266 lexical errors, or 10.15%), no learner used less than 90% of basic vocabulary, and some did not use a single rare word (mean T1: 3.9% (SD 2.27); T2: 2.53 (SD 1.84%)). This scale variable thus could not help distinguish learners from one another in free production tasks. It requires a different task type to decide if the almost exclusive recurrence to basic vocabulary is due to a lack of declarative lexical knowledge or to lexical avoidance strategies. CLUSTERING LEARNER LANGUAGE WITH CEFR SCALE VARIABLES Fluency Few scale variable correlations were significant inside and across task/language(s). The ones more coherently related to each other mirror temporal aspects of fluency (B2: number of pauses/syllable and mean length of pauses, 0.775 in T1, 0.814 in T2). However, these scale variables were not related to other, more unusual scale variables like ‘B1 pauses’ (Table 6). Table 6: Cluster solutions for the fluency scale Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 (without false starts) T1 ITA Length of pauses Λ = .474, p = .018 GER Complexity Λ = .174, p = .001 B1 T2 ITA ↑: Pauses dep. on utter. length Λ = .075, p = .009 ↓: Incompreh. utter. Λ = .309, p = .040 GER ↑: Pauses dep. on utter. length Λ = .092, p = .003 Contradictory A2 T2 ITA Number of pauses Λ = .189, p = .000 GER Number of pauses Λ = .248, p = .002 B2 T1 ITA – – GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 (without false starts) T1 ITA Length of pauses Λ = .474, p = .018 GER Complexity Λ = .174, p = .001 B1 T2 ITA ↑: Pauses dep. on utter. length Λ = .075, p = .009 ↓: Incompreh. utter. Λ = .309, p = .040 GER ↑: Pauses dep. on utter. length Λ = .092, p = .003 Contradictory A2 T2 ITA Number of pauses Λ = .189, p = .000 GER Number of pauses Λ = .248, p = .002 B2 T1 ITA – – GER – – Notes: Wilks’ Lambda is a measure of the proportion of variance NOT explained by the variable. In case of the B1 level, the target cluster can be separated from ‘higher’ (↑) and ‘lower’ (↓) clusters. All measures normalized (per minute, syllable, token). Table 6: Cluster solutions for the fluency scale Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 (without false starts) T1 ITA Length of pauses Λ = .474, p = .018 GER Complexity Λ = .174, p = .001 B1 T2 ITA ↑: Pauses dep. on utter. length Λ = .075, p = .009 ↓: Incompreh. utter. Λ = .309, p = .040 GER ↑: Pauses dep. on utter. length Λ = .092, p = .003 Contradictory A2 T2 ITA Number of pauses Λ = .189, p = .000 GER Number of pauses Λ = .248, p = .002 B2 T1 ITA – – GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 (without false starts) T1 ITA Length of pauses Λ = .474, p = .018 GER Complexity Λ = .174, p = .001 B1 T2 ITA ↑: Pauses dep. on utter. length Λ = .075, p = .009 ↓: Incompreh. utter. Λ = .309, p = .040 GER ↑: Pauses dep. on utter. length Λ = .092, p = .003 Contradictory A2 T2 ITA Number of pauses Λ = .189, p = .000 GER Number of pauses Λ = .248, p = .002 B2 T1 ITA – – GER – – Notes: Wilks’ Lambda is a measure of the proportion of variance NOT explained by the variable. In case of the B1 level, the target cluster can be separated from ‘higher’ (↑) and ‘lower’ (↓) clusters. All measures normalized (per minute, syllable, token). Clusters that were completely statistically homogeneous (F < 1) and plausible, i.e. corresponding to CEFR predictions (‘good’ solutions in Table 6), could only be found for A2/T1 (if ‘false starts’ were excluded) and for B1/T2. Statistically homogeneous clusters with scale variables assuming values not foreseen in the scale (‘contradictory’ in Table 6) were found in two scenarios (A2/T2 and B2/T1). Discriminant analyses showed that single scale variables helped explain variance between ‘CEFR clusters’ and other learners, although in many cases they had only very weak explanatory power. For the other scenarios, no clustering was possible (Table 7). Learners regularly behaved differently from what level descriptions suggest. Scale variables resulted in very heterogeneous clusters. Table 7 also shows that the feasibility of the clusterings was generally more dependent on the task than on the target language. Table 7: Overview possibility of cluster solutions Notes: Good cluster solutions with F < 1 & correspondence to scale predictions (), contradictory cluster solutions with F < 1 but no correspondence to scale predictions (), no clustering possible (). Table 7: Overview possibility of cluster solutions Notes: Good cluster solutions with F < 1 & correspondence to scale predictions (), contradictory cluster solutions with F < 1 but no correspondence to scale predictions (), no clustering possible (). Many speakers’ performances fitted more than one level description (N = 9), others none (N = 3 in T1, but N = 8 in T2, Table 8). Table 8: Learners in clusters Fluency Vocabulary range Vocabulary control T1 T2 T1 T2 T1 T2 Language Learner Cluster Cluster Cluster Cluster Cluster Cluster ITA 1 B1 – B2 – A2 – 2 – A2/B1 – B1 A2 A2/B1 3 B1/B2 – B2 – A2/B1/B2 A2/B1/B2 4 A2 B1/B2 – – – A2 5 A2/B1 A2 – B2 B2 A2 6 A2 A2 – B2 – A2 7 B2 – B2 – A2/B1/B2 A2/B1/B2 8 – – B2 – A2 A2 9 B2 A2 – B1/B2 B2 A2 10 A2/B1/B2 B1 – – B2 A2/B1/B2 GER 11 A2/B1 A2 – – – – 12 A2 A2 B2 – – A2 13 B2 – B2 B2 B2 A2/B2 14 – – – – A2/B1/B2 A2/B1/B2 15 B1/B2 – – B2 B1/B2 A2/B1 16 B2 B1/B2 – – A2 A2/B1/B2 17 B2 – B2 – B2 A2 18 A2/B1 A2 B2 – A2/B1/B2 B1/B2 19 A2 A2 – – – – Fluency Vocabulary range Vocabulary control T1 T2 T1 T2 T1 T2 Language Learner Cluster Cluster Cluster Cluster Cluster Cluster ITA 1 B1 – B2 – A2 – 2 – A2/B1 – B1 A2 A2/B1 3 B1/B2 – B2 – A2/B1/B2 A2/B1/B2 4 A2 B1/B2 – – – A2 5 A2/B1 A2 – B2 B2 A2 6 A2 A2 – B2 – A2 7 B2 – B2 – A2/B1/B2 A2/B1/B2 8 – – B2 – A2 A2 9 B2 A2 – B1/B2 B2 A2 10 A2/B1/B2 B1 – – B2 A2/B1/B2 GER 11 A2/B1 A2 – – – – 12 A2 A2 B2 – – A2 13 B2 – B2 B2 B2 A2/B2 14 – – – – A2/B1/B2 A2/B1/B2 15 B1/B2 – – B2 B1/B2 A2/B1 16 B2 B1/B2 – – A2 A2/B1/B2 17 B2 – B2 – B2 A2 18 A2/B1 A2 B2 – A2/B1/B2 B1/B2 19 A2 A2 – – – – Table 8: Learners in clusters Fluency Vocabulary range Vocabulary control T1 T2 T1 T2 T1 T2 Language Learner Cluster Cluster Cluster Cluster Cluster Cluster ITA 1 B1 – B2 – A2 – 2 – A2/B1 – B1 A2 A2/B1 3 B1/B2 – B2 – A2/B1/B2 A2/B1/B2 4 A2 B1/B2 – – – A2 5 A2/B1 A2 – B2 B2 A2 6 A2 A2 – B2 – A2 7 B2 – B2 – A2/B1/B2 A2/B1/B2 8 – – B2 – A2 A2 9 B2 A2 – B1/B2 B2 A2 10 A2/B1/B2 B1 – – B2 A2/B1/B2 GER 11 A2/B1 A2 – – – – 12 A2 A2 B2 – – A2 13 B2 – B2 B2 B2 A2/B2 14 – – – – A2/B1/B2 A2/B1/B2 15 B1/B2 – – B2 B1/B2 A2/B1 16 B2 B1/B2 – – A2 A2/B1/B2 17 B2 – B2 – B2 A2 18 A2/B1 A2 B2 – A2/B1/B2 B1/B2 19 A2 A2 – – – – Fluency Vocabulary range Vocabulary control T1 T2 T1 T2 T1 T2 Language Learner Cluster Cluster Cluster Cluster Cluster Cluster ITA 1 B1 – B2 – A2 – 2 – A2/B1 – B1 A2 A2/B1 3 B1/B2 – B2 – A2/B1/B2 A2/B1/B2 4 A2 B1/B2 – – – A2 5 A2/B1 A2 – B2 B2 A2 6 A2 A2 – B2 – A2 7 B2 – B2 – A2/B1/B2 A2/B1/B2 8 – – B2 – A2 A2 9 B2 A2 – B1/B2 B2 A2 10 A2/B1/B2 B1 – – B2 A2/B1/B2 GER 11 A2/B1 A2 – – – – 12 A2 A2 B2 – – A2 13 B2 – B2 B2 B2 A2/B2 14 – – – – A2/B1/B2 A2/B1/B2 15 B1/B2 – – B2 B1/B2 A2/B1 16 B2 B1/B2 – – A2 A2/B1/B2 17 B2 – B2 – B2 A2 18 A2/B1 A2 B2 – A2/B1/B2 B1/B2 19 A2 A2 – – – – Vocabulary range For the vocabulary range scale variables, there is not one consistent correlation across languages and tasks. There are no significant correlations among the B1 variables, and only two on B2 (GER/T1: 0.724, circumlocutions and repetitions per syllable; GER/T2: 0.814, incomprehensible utterances/token and lexical planning pauses/syllable). Only on level B2 and B1 it was possible to find ‘CEFR clusters’ (Tables 7 and 9). Most learners fitted no level description (T1: N = 11, T2: N = 13, Table 8). Table 9: Cluster solutions for the vocabulary range scale Quality Level Task Language Variance Variable Wilks’ Lambda Good B1 T2 ITA ↓ Incompreh. utter., Λ = .208, p = .017 ↑ Circumlocutions Λ = .002, p = .000 B2 T1 ITA – – GER – – B2 T2 ITA Lex. planning pauses Λ = .417, p = .010 Repetitions Λ = .159, p = .002 GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good B1 T2 ITA ↓ Incompreh. utter., Λ = .208, p = .017 ↑ Circumlocutions Λ = .002, p = .000 B2 T1 ITA – – GER – – B2 T2 ITA Lex. planning pauses Λ = .417, p = .010 Repetitions Λ = .159, p = .002 GER – – Table 9: Cluster solutions for the vocabulary range scale Quality Level Task Language Variance Variable Wilks’ Lambda Good B1 T2 ITA ↓ Incompreh. utter., Λ = .208, p = .017 ↑ Circumlocutions Λ = .002, p = .000 B2 T1 ITA – – GER – – B2 T2 ITA Lex. planning pauses Λ = .417, p = .010 Repetitions Λ = .159, p = .002 GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good B1 T2 ITA ↓ Incompreh. utter., Λ = .208, p = .017 ↑ Circumlocutions Λ = .002, p = .000 B2 T1 ITA – – GER – – B2 T2 ITA Lex. planning pauses Λ = .417, p = .010 Repetitions Λ = .159, p = .002 GER – – For the B1 level descriptions, in the other scenarios (GER & ITA/T1 and GER/T2) the clusters assumed values that systematically contradicted the CEFR scale. As the A2 level description was true for all individuals in the sample, it could not be used to group speakers. Vocabulary control The vocabulary control scale variables correlate modestly. On B1, the ‘number of errors in elementary vocabulary/token’ correlates quite consistently with the ‘number of major errors/token’ (GER/T1: 0.915, T2: 0.813, ITA/T2: 0.899). On level B2, the ‘number of incorrect word choices/token’ correlates with the ‘number of lexical errors/token’ (GER/T1: 0.954, GER/T2: 0.968; ITA/T1: 0.864, ITA/T2: 0.914). Clusters are more ambiguous than for the other scales (see Table 8): Many productions correspond well to two (N = 5) or even three (N = 9) level descriptions. More than half of the productions in the sample thus could not unambiguously be matched to one level description. On B1, clusters could be found, although again the scale variables showed strongly variable patterns, and none of them significantly reduced variance between clusters. On B2, although some individuals’ behavior corresponded to the scale, single scale variables were rarely observable (Table 10) so that the plausibility of the solutions is questionable; also, statistical homogeneity was partially threatened. Table 10: Cluster solutions for the vocabulary control scale Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 T1 ITA Errors everyday needs Λ = .258, p = .001 GER Errors everyday needs Λ = .173, p = .001 A2 T2 ITA – – GER Errors everyday needs Λ = .127, p = .000 B1 T1 – – – – B1 T2 – – – – Contradictory B2 T1 ITA Lexical errors Λ = .459, p = .024 GER Lexical errors incompr. utter. Λ = .195, p = .001 Λ = .096, p = .001 B2 T2 ITA Incompr. utter. Λ = .569, p = .039 GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 T1 ITA Errors everyday needs Λ = .258, p = .001 GER Errors everyday needs Λ = .173, p = .001 A2 T2 ITA – – GER Errors everyday needs Λ = .127, p = .000 B1 T1 – – – – B1 T2 – – – – Contradictory B2 T1 ITA Lexical errors Λ = .459, p = .024 GER Lexical errors incompr. utter. Λ = .195, p = .001 Λ = .096, p = .001 B2 T2 ITA Incompr. utter. Λ = .569, p = .039 GER – – Table 10: Cluster solutions for the vocabulary control scale Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 T1 ITA Errors everyday needs Λ = .258, p = .001 GER Errors everyday needs Λ = .173, p = .001 A2 T2 ITA – – GER Errors everyday needs Λ = .127, p = .000 B1 T1 – – – – B1 T2 – – – – Contradictory B2 T1 ITA Lexical errors Λ = .459, p = .024 GER Lexical errors incompr. utter. Λ = .195, p = .001 Λ = .096, p = .001 B2 T2 ITA Incompr. utter. Λ = .569, p = .039 GER – – Quality Level Task Language Variance Variable Wilks’ Lambda Good A2 T1 ITA Errors everyday needs Λ = .258, p = .001 GER Errors everyday needs Λ = .173, p = .001 A2 T2 ITA – – GER Errors everyday needs Λ = .127, p = .000 B1 T1 – – – – B1 T2 – – – – Contradictory B2 T1 ITA Lexical errors Λ = .459, p = .024 GER Lexical errors incompr. utter. Λ = .195, p = .001 Λ = .096, p = .001 B2 T2 ITA Incompr. utter. Λ = .569, p = .039 GER – – RELATING SCALE VARIABLES TO RESEARCH-BASED MEASURES Fluency In a first step, correlations between scale variables and research-based measures were analysed. The analysis showed mixed results, with most correlations found for the A2 level and more consistent correlations in T1 (Table A2 for list of measures). On level A2 and B2, the more functional scale variables have consistent correlations to a temporal ‘low order fluency’ (Lennon 2000).5 In addition, T-tests were used to find fluency-related measures distinguishing CEFR clusters (where feasible, except from B2 with N = 1 in the cluster) from other speakers. Table 11 shows that three standard measures of fluency (the phonation-time ratio, the mean length of runs, and the number of long pauses) significantly differentiate between speakers clustered as A2 and other speakers (p < 0.05; effect sizes are medium to high). However, for the B1 fluency cluster no homogeneous set of fluency measures supporting the cluster could be found. Table 11: Fluency-related measures and clusters Level Task Cluster quality Language Measure p-value of T-Test Effect size (η2) A2 T1 Good ITA Phonation-time ratio .04 .428 T1 Good GER Phonation-time ratio .042 .467 T2 Restricted ITA Phonation-time ratio .011 .710 T2 Restricted GER Phonation-time ratio .001 .003 T2 Restricted ITA Mean length of runs .002 .735 T2 Restricted GER Mean length of runs .003 .729 T1 Good GER Long pauses/minute .045 .459 T2 Restricted ITA Long pauses/minute .047 .408 T2 Restricted GER Long pauses/minute .045 .459 B1 T2 Restricted ITA ↓ Speech rate .018 .863 T2 Restricted ITA ↓ Phonation-time ratio .039 .792 T2 Restricted ITA ↓ Repetitions/token .017 .698 T2 Restricted ITA ↓ Constituent-internal pauses/min. .006 .612 T2 Restricted GER↑ Repetitions/token .003 .816 T2 Restricted GER↑ Hesitation phenomena/token .011 .549 T2 Restricted GER↑ Lengthenings/token .040 .292 Level Task Cluster quality Language Measure p-value of T-Test Effect size (η2) A2 T1 Good ITA Phonation-time ratio .04 .428 T1 Good GER Phonation-time ratio .042 .467 T2 Restricted ITA Phonation-time ratio .011 .710 T2 Restricted GER Phonation-time ratio .001 .003 T2 Restricted ITA Mean length of runs .002 .735 T2 Restricted GER Mean length of runs .003 .729 T1 Good GER Long pauses/minute .045 .459 T2 Restricted ITA Long pauses/minute .047 .408 T2 Restricted GER Long pauses/minute .045 .459 B1 T2 Restricted ITA ↓ Speech rate .018 .863 T2 Restricted ITA ↓ Phonation-time ratio .039 .792 T2 Restricted ITA ↓ Repetitions/token .017 .698 T2 Restricted ITA ↓ Constituent-internal pauses/min. .006 .612 T2 Restricted GER↑ Repetitions/token .003 .816 T2 Restricted GER↑ Hesitation phenomena/token .011 .549 T2 Restricted GER↑ Lengthenings/token .040 .292 Notes: T-test results; significance at 5% level. Table 11: Fluency-related measures and clusters Level Task Cluster quality Language Measure p-value of T-Test Effect size (η2) A2 T1 Good ITA Phonation-time ratio .04 .428 T1 Good GER Phonation-time ratio .042 .467 T2 Restricted ITA Phonation-time ratio .011 .710 T2 Restricted GER Phonation-time ratio .001 .003 T2 Restricted ITA Mean length of runs .002 .735 T2 Restricted GER Mean length of runs .003 .729 T1 Good GER Long pauses/minute .045 .459 T2 Restricted ITA Long pauses/minute .047 .408 T2 Restricted GER Long pauses/minute .045 .459 B1 T2 Restricted ITA ↓ Speech rate .018 .863 T2 Restricted ITA ↓ Phonation-time ratio .039 .792 T2 Restricted ITA ↓ Repetitions/token .017 .698 T2 Restricted ITA ↓ Constituent-internal pauses/min. .006 .612 T2 Restricted GER↑ Repetitions/token .003 .816 T2 Restricted GER↑ Hesitation phenomena/token .011 .549 T2 Restricted GER↑ Lengthenings/token .040 .292 Level Task Cluster quality Language Measure p-value of T-Test Effect size (η2) A2 T1 Good ITA Phonation-time ratio .04 .428 T1 Good GER Phonation-time ratio .042 .467 T2 Restricted ITA Phonation-time ratio .011 .710 T2 Restricted GER Phonation-time ratio .001 .003 T2 Restricted ITA Mean length of runs .002 .735 T2 Restricted GER Mean length of runs .003 .729 T1 Good GER Long pauses/minute .045 .459 T2 Restricted ITA Long pauses/minute .047 .408 T2 Restricted GER Long pauses/minute .045 .459 B1 T2 Restricted ITA ↓ Speech rate .018 .863 T2 Restricted ITA ↓ Phonation-time ratio .039 .792 T2 Restricted ITA ↓ Repetitions/token .017 .698 T2 Restricted ITA ↓ Constituent-internal pauses/min. .006 .612 T2 Restricted GER↑ Repetitions/token .003 .816 T2 Restricted GER↑ Hesitation phenomena/token .011 .549 T2 Restricted GER↑ Lengthenings/token .040 .292 Notes: T-test results; significance at 5% level. Vocabulary range While the scale variables related to concrete lexical fields were difficult to observe, none of the remaining scale variables was correlated to the construct of vocabulary breadth, nor in most cases to any other aspect of lexical competence (Table A3 for measures; Read 2000, 2007; Read and Chapelle 2001; Nation 2001, Wisniewski 2012). The variables regarding lexical planning pauses, circumlocutions, and repetitions (B2) seem to be related to hesitation phenomena, but the correlation analyses results are heterogeneous (languages/tasks); also, to an even lesser degree, there are sporadic links to communication strategies.6 Only single aspects in single scenarios showed significant differences between the clusters. For example, on B1, only for ITA/T2 a cluster solution was possible. These B1 speakers were distinguishable from speakers clustered higher in that they used more literal translations from their L1. B2-clustered speakers used less very common words than weaker speakers, their Guiraud’s Index was higher, and they used, surprisingly, more reformulations (GER/T2: mean 0.014, SD 0.001; weaker clusters: mean 0.006, SD 0.004). However, effect sizes are rather low (Table 12). Table 12: Vocabulary range measures distinguishing between CEFR clusters Level Task Language Measure p-value Effect size (η2) B1 T2 ITA↑ Transfer/token .049 .001 B2 T1 ITA % Most frequent 1,000 words .048 .320 B2 T1 GET Guiraud .046 .143 B2 T2 GER Reformulations/token .005 .521 Level Task Language Measure p-value Effect size (η2) B1 T2 ITA↑ Transfer/token .049 .001 B2 T1 ITA % Most frequent 1,000 words .048 .320 B2 T1 GET Guiraud .046 .143 B2 T2 GER Reformulations/token .005 .521 Notes: T-Test results; significance at 5% level. Table 12: Vocabulary range measures distinguishing between CEFR clusters Level Task Language Measure p-value Effect size (η2) B1 T2 ITA↑ Transfer/token .049 .001 B2 T1 ITA % Most frequent 1,000 words .048 .320 B2 T1 GET Guiraud .046 .143 B2 T2 GER Reformulations/token .005 .521 Level Task Language Measure p-value Effect size (η2) B1 T2 ITA↑ Transfer/token .049 .001 B2 T1 ITA % Most frequent 1,000 words .048 .320 B2 T1 GET Guiraud .046 .143 B2 T2 GER Reformulations/token .005 .521 Notes: T-Test results; significance at 5% level. Vocabulary control Some vocabulary control scale variables were correlated to construct-based measures of lexical accuracy, that is, to the number of lexical errors/token, to the Lexical Quality Indicator and, in the monologue task, to the percentage of error-free AS units (Table A3, appendix), although it was not easy to find measures correlated to all language-task scenarios (Table 13). Table 13: Correlations of measures of lexical correctness with CEFR scale variables (vocabulary control) Level Task Scale variable Measure Correlation (r) A2 T1 ‘Errors everyday needs vocabulary’ Lexical errors/token .627 T2 .460 B1 T1 ‘Errors in elementary vocabulary’ .934 T2 .982 T1 ‘Major errors’ .914 T2 .743 A2 T1 ‘Errors everyday needs vocabulary’ Lexical Quality Indicator −.543 T2 −.494 B1 T1 ‘Errors in elementary vocabulary’ −.666 T2 −.636 T1 ‘Major errors’ −.708 T2 −.647 B2 T1 ‘Lexical errors’ −.701 T2 −.644 A2 T1 ‘Errors everyday needs vocabulary’ % Error-free AS-units −.689 B1 T1 ‘Errors in elementary vocabulary’ −.679 B1 T1 ‘Major errors’ −.582 B2 T1 ‘Lexical errors’ −.668 Level Task Scale variable Measure Correlation (r) A2 T1 ‘Errors everyday needs vocabulary’ Lexical errors/token .627 T2 .460 B1 T1 ‘Errors in elementary vocabulary’ .934 T2 .982 T1 ‘Major errors’ .914 T2 .743 A2 T1 ‘Errors everyday needs vocabulary’ Lexical Quality Indicator −.543 T2 −.494 B1 T1 ‘Errors in elementary vocabulary’ −.666 T2 −.636 T1 ‘Major errors’ −.708 T2 −.647 B2 T1 ‘Lexical errors’ −.701 T2 −.644 A2 T1 ‘Errors everyday needs vocabulary’ % Error-free AS-units −.689 B1 T1 ‘Errors in elementary vocabulary’ −.679 B1 T1 ‘Major errors’ −.582 B2 T1 ‘Lexical errors’ −.668 Notes: Correlations given if consistently significant (p <.05) for both languages and tasks; values calculated across languages. All scale variables per token (see Table AI). Table 13: Correlations of measures of lexical correctness with CEFR scale variables (vocabulary control) Level Task Scale variable Measure Correlation (r) A2 T1 ‘Errors everyday needs vocabulary’ Lexical errors/token .627 T2 .460 B1 T1 ‘Errors in elementary vocabulary’ .934 T2 .982 T1 ‘Major errors’ .914 T2 .743 A2 T1 ‘Errors everyday needs vocabulary’ Lexical Quality Indicator −.543 T2 −.494 B1 T1 ‘Errors in elementary vocabulary’ −.666 T2 −.636 T1 ‘Major errors’ −.708 T2 −.647 B2 T1 ‘Lexical errors’ −.701 T2 −.644 A2 T1 ‘Errors everyday needs vocabulary’ % Error-free AS-units −.689 B1 T1 ‘Errors in elementary vocabulary’ −.679 B1 T1 ‘Major errors’ −.582 B2 T1 ‘Lexical errors’ −.668 Level Task Scale variable Measure Correlation (r) A2 T1 ‘Errors everyday needs vocabulary’ Lexical errors/token .627 T2 .460 B1 T1 ‘Errors in elementary vocabulary’ .934 T2 .982 T1 ‘Major errors’ .914 T2 .743 A2 T1 ‘Errors everyday needs vocabulary’ Lexical Quality Indicator −.543 T2 −.494 B1 T1 ‘Errors in elementary vocabulary’ −.666 T2 −.636 T1 ‘Major errors’ −.708 T2 −.647 B2 T1 ‘Lexical errors’ −.701 T2 −.644 A2 T1 ‘Errors everyday needs vocabulary’ % Error-free AS-units −.689 B1 T1 ‘Errors in elementary vocabulary’ −.679 B1 T1 ‘Major errors’ −.582 B2 T1 ‘Lexical errors’ −.668 Notes: Correlations given if consistently significant (p <.05) for both languages and tasks; values calculated across languages. All scale variables per token (see Table AI). However, in T-tests and discriminant analyses no measure of lexical accuracy could clearly separate the clusters if taken as a whole. This suggests that single scale variables might have a clearer link to the construct than the combined level description scale variables. A particular difficulty is the vertical positioning of the A2 scale variable. Learners fitting the level description had high values in several lexical (accuracy) measures (Table 14). The A2 scale variable (errors in concrete everyday needs vocabulary/token) was thus relatable to the lexical accuracy construct, but seems to indicate a higher level of ability. Table 14: Research-based measures of vocabulary control in the A2 cluster and in productions clustered higher than A2 Task/ language Measure p-value Effect size (η2) Mean A2 cluster (SD) Mean higher clusters (SD) T1/GER Lexical Quality Indicator .019 .163 69 (21.1) 39 (7.1) % Rare words .05 .420 4.04 (3.49) 3.17 (1.8) T1/ITA % Rare words .048 .235 3.8 (2.12) 1.81 (.21) T2/GER Errors in formulaic sequences/token .042 .205 .003 (.003) .006 (.001) T2/GER Lexical Quality Indicator .037 .368 104.14 (37.74) 4.5 (17.68) T2/ITA Errors in formulaic sequences/token .047 .408 .004 (.002) .01 (.00) Task/ language Measure p-value Effect size (η2) Mean A2 cluster (SD) Mean higher clusters (SD) T1/GER Lexical Quality Indicator .019 .163 69 (21.1) 39 (7.1) % Rare words .05 .420 4.04 (3.49) 3.17 (1.8) T1/ITA % Rare words .048 .235 3.8 (2.12) 1.81 (.21) T2/GER Errors in formulaic sequences/token .042 .205 .003 (.003) .006 (.001) T2/GER Lexical Quality Indicator .037 .368 104.14 (37.74) 4.5 (17.68) T2/ITA Errors in formulaic sequences/token .047 .408 .004 (.002) .01 (.00) Notes: T-test results (significance at 5% level) and effect sizes. Table 14: Research-based measures of vocabulary control in the A2 cluster and in productions clustered higher than A2 Task/ language Measure p-value Effect size (η2) Mean A2 cluster (SD) Mean higher clusters (SD) T1/GER Lexical Quality Indicator .019 .163 69 (21.1) 39 (7.1) % Rare words .05 .420 4.04 (3.49) 3.17 (1.8) T1/ITA % Rare words .048 .235 3.8 (2.12) 1.81 (.21) T2/GER Errors in formulaic sequences/token .042 .205 .003 (.003) .006 (.001) T2/GER Lexical Quality Indicator .037 .368 104.14 (37.74) 4.5 (17.68) T2/ITA Errors in formulaic sequences/token .047 .408 .004 (.002) .01 (.00) Task/ language Measure p-value Effect size (η2) Mean A2 cluster (SD) Mean higher clusters (SD) T1/GER Lexical Quality Indicator .019 .163 69 (21.1) 39 (7.1) % Rare words .05 .420 4.04 (3.49) 3.17 (1.8) T1/ITA % Rare words .048 .235 3.8 (2.12) 1.81 (.21) T2/GER Errors in formulaic sequences/token .042 .205 .003 (.003) .006 (.001) T2/GER Lexical Quality Indicator .037 .368 104.14 (37.74) 4.5 (17.68) T2/ITA Errors in formulaic sequences/token .047 .408 .004 (.002) .01 (.00) Notes: T-test results (significance at 5% level) and effect sizes. There is also the danger of construct-irrelevance. On B2, two scale variables (‘confusions/synforms’; ‘errors that hinder communication’) could not be connected to lexical accuracy. ‘Synforms’, hardly observable, were weakly related to aspects of lexis other than its accuracy, particularly sophistication (T2, correlations: % rare words 0.548; weighted lexical density 0.502; Advanced Guiraud 0.528). The B2 scale variable related to communicative success almost exclusively correlated with comprehensibility of utterances (incomprehensible utterances in basic vocabulary/everyday life: T1: 0.726, T2: 0.768; incomprehensible utterances in general topics T1: 0.989, T2: 0.923). SUMMARY AND DISCUSSION The results show that (i) the observability of operationalized descriptors (‘scale variables’) was limited. Some were not as salient as suggested. Others assumed evenly distributed values among participants, raising questions about their typicality and leading to level assignment dilemmata. Still others were hardly observed at all, or they were speculative. The number of scale variables per level is very low so that a lack of observability can put the usefulness of the whole level description at stake. It was easiest to find correlates for the fluency scale, while the vocabulary range scale was particularly problematic. Additionally, some assumptions underlying the scales (e.g. more lexical errors in more complex topics) were not confirmed. Second, the attempt at (ii) matching learner productions to (the operationalizable parts of) CEFR scales led to consistency problems. Correlations among scale variables were weak, often inconsistent, and sometimes contradictory. Scale variables were not sufficient to describe what learners did, and satisfying clusters could not be found for all level descriptions. Often, data contradicted scale predictions. Scale variable values were observable in many different patterns, so that the level descriptions did not seem to capture typical behavior. In those cases, level descriptions captured what some learners did, while ignoring many others. Many productions did not match any level description, while plenty fitted two or even three of them (Table 8). Again, the fluency scale was least problematic. Furthermore, it was only partly possible to (iii) link scale variables to research-based measures. The fluency scale variables were related to a temporal construct of ‘low order fluency’ (Lennon 2000). The vocabulary control scale contains some more functional scale variables, all related to the number of errors. Thus, analogously, a ‘low order accuracy’ construct is the least problematic—sometimes the only applicable—aspect of this scale. Both construct interpretations contradict the approach to language learning of the CEFR. It was impossible to relate the vocabulary range scale variables to a lexical construct. Other drawbacks regard construct-irrelevance and questionable vertical positioning of single scale variables. No additional information on the complexity of the three constructs is provided for CEFR users (e.g. for fluency: difference between utterance/cognitive/perceived fluency, influence of tasks, automatization and more, Segalowitz 2010)—the CEFR text thus does not help to even out scale inconsistencies. Since the CEFR claims to be valid across languages (Alderson 2007: 660; Fulcher 2008: 167), differences between target languages (Italian/German) should be minimal. While in fact some results varied with the target language, there was a stronger tendency for the task type to make a difference with regard to observability, especially for the vocabulary scales. It seems all but trivial to construct one (or more) productive task(s) that contains language material ratable with the CEFR vocabulary scales (Bachman and Palmer 2010: 351). As pointed out above, this study is not to be misunderstood as a comprehensive CEFR scale validation. The time-consuming methodology made it necessary to focus on a clearly outlined context. This explorative study takes into account language samples elicited in two oral proficiency test tasks by N = 19 learners of Italian and German. Obviously, the language produced in this context does not create a comprehensive picture of the participants’ L2. Also, the study might have yielded different results if the tasks had been designed to speficically get participants to show their best fluency, vocabulary control, and knowledge of vocabulary, or if a non-test setting had been parted from. Still, the language samples can reasonably be believed to represent what is regularly done in proficiency tests, which are a context in which it is particularly desirable for CEFR scale contents to be useful. In the given context, the results imply that it can be very problematic to use CEFR scales for the description of learner language. However, the results cannot be transferred to other contexts of language use without further empirical analyses. It is highly desirable for future validation studies to extend the range of analysed task types. Many studies overarching a wide range of languages and tasks involving longer stretches of written and spoken speech produced in a broad range of contexts are needed in order to address the empirical validity of CEFR scales more fully. This study, exploratory in nature and constrained in size, is meant as a first step into that direction. While its results must not be generalized, it might help to deliver a ‘snapshot’ (North 2014b: 23) of the empirical robustness of selected level descriptions of three CEFR scales in a specific context. Addressing the empirical validity aspects in focus here is believed to be possible only without reliance on human ratings. As a consequence, this study adopts a perspective that links CEFR scales to learner language in a very literal way, as objectively as possible. Not all aspects of the three scales could be operationalized. This procedure would seem a rather artificial thing to do in real-world contexts, but is considered unavoidable for scale validation purposes. CONCLUSION The Common European Framework of Reference has contributed considerably to improving the quality and transparency of language teaching, learning, and assessment in Europe (and beyond). Its learner-oriented, communicative approach to L2 language competence has profoundly influenced the reality of language education. Contrary to the authoring team’s original intentions, the scales are often understood as the ‘core’ of the CEFR. Fulcher et al. (2011: 232) diagnosed a ‘reification’ of the CEFR level system, intending the process of a convention that assumes the alleged character of a hard fact. It is the use that is made of the CEFR scales rather than the instruments themselves that constitutes a major problem. In many situations, the scales are misunderstood and over-interpreted. Often their suitability to describe L2 competence is overestimated. This tendency becomes dangerous if decisions about learners’ lifes are taken on the basis of CEFR scales. Thus, although the strengths and the constraints of the CEFR scales have been repeatedly defined by their authors (e.g. North 2000; North 2014), we are now in a situation where school curricula, educational standards, and language tests are readily related to CEFR levels, whilst many validity aspects of the scales have not yet been examined (North 2014: 44). This explorative study, in an attempt at non-circular validation of empirical robustness aspects, has shown that for the sample under consideration and the context of analysis, there are shortcomings in the empirical validity of the CEFR scales for fluency, vocabulary range and vocabulary control on the levels A2-B2. As mentioned, these results must not be generalized. However, they underline the need for more validation studies based on empirical learner language, for all CEFR scales and levels, for written and spoken L2, and for more languages. Such projects could greatly profit from the creation of large learner corpora (see Wisniewski, forthcoming). Fortunately, the CEFR defines itself open to revisions (CoE 2001: 8) so that it might be possible to integrate validation findings into future scale versions.Until we know more about how shaky the ground beneath the CEFR really is (Hulstijn 2007), however, great care ought to be taken when referring to CEFR levels. Conflict of interest statement. None declared. NOTES Footnotes 1 Wisniewski (2014) contains a study of the theoretical validity of the fluency and both vocabulary scales. For reasons of space, the results cannot be discussed here. 2 A recent initiative aiming at both illustration and validation of CEFR levels is MERLIN (www.merlin-platform.eu) that compiled a trilingual, freely accessible annotated learner corpus (Wisniewski et al. 2013, Abel et al. 2014). 3 The lexical/functional fields were defined in tagsets. All unclear utterances were re-assessed (basic communicative needs/everyday vocabulary/general vocabulary), independently of the determination of type frequency which was also carried through. 4 The CEFR text does not define lexical errors. The operationalization of the scale concept of “error severeness” here rests on the assumptions that errors are more severe if they 1) lead to communication problems and 2) occur in the communication of everyday/survival needs. Coders additionally classified errors according to their appearance in concrete vs abstract and specifc vs general topics. 5 ‘Functional’ scale variables on A2: length/number of pauses (e.g. phonation-time ratio −0.631/−0.795 in T1, mean length of runs T1 −0.906/−0.928, speech rate −0.680/−0.849, in T1, −0.559 in T2, number of long pauses/syllable or minute 0.905/0.766 T1, 0.833/0.731, T2). These scale variables did not significantly correlate with hesitation phenomena or strategies. B1: ‘B1 pauses’ (for lexical/grammatical planning/repair) and number of pauses/syllable (T2: 0.621), speech rate (T2: −0.707) and ‘keep goings’ (continuing speaking after a lexical problem, T1: 0.811, T2: 0.599). B2: pause-related scale variables (number of pauses/syllable and number of long pauses/syllable) consistently related to fluency-based measures (particularly in T1: phonation-time ratio −0.786/−0.782; mean length of runs −0.624/−0.906; speech rate −0.808/−0.849). 6 Repetitions (.56, T2 across languages), lexical planning pauses (.912, ITA/T1 and .659 across languages, T2), and circumlocutions (.669, GER/T1) were correlated to events in which learners kept going after a lexical formulation difficulty (indicators: occurrence of an L1/L3 expression, deadend, circumlocution, lexical error, inappropriate lexical item/generalization plus one or more pauses). Lexical planning pauses were correlated to filled pauses (.817, T1, across languages), while circumlocutions had a correlation with repairs (−0.714, ITA/T2) and false starts (.691, ITA/T2). Phonetic lengthenings were correlated with lexical planning pauses (.763, T2) and circumlocutions (.717, T1) for GER L2. REFERENCES Abel A. , Vettori C. , Wisniewski K. (eds). 2012 . Gli studenti altoatesini e la seconda lingua: indagine linguistica e psicosociale./Die Südtiroler SchülerInnen und die Zweitsprache: eine linguistische und sozialpsychologische Untersuchung , vol. 2 . Eurac . Available at http://www.eurac.edu/de/research/Publications/Pages/publicationdetails.aspx?pubId=0100156&type=Q. Abel A. , Nicolas L. , Wisniewski K. , Boyd K. A. , Hana J. . 2014 . ‘ A trilingual learner corpus illustrating European reference levels ,’ Ricognizioni. Rivista Di Lingue E Letterature E Culture Moderne 2 : 111 – 26 . Available at http://www.ojs.unito.it/index.php/ricognizioni/index. Alderson C. J. 1991 . ‘Bands and scores’ in Alderson J. C. , North B. (eds): Language Testing in the 1990s . British Council and Macmillan . pp. 71 – 86 . Alderson C. J. 2007 . ‘ The CEFR and the need for more research ,’ The Modern Language Journal 91 : 658 – 62 . Google Scholar Crossref Search ADS Alderson J. C. , Kremmel B. 2013 . ‘ Re-examining the content validation of a grammar test: The (im)possibility of distinguishing vocabulary and structural knowledge ,’ Language Testing 30 : 535 – 56 . Google Scholar Crossref Search ADS Arnaud P. J. L. 1984 . ‘The lexical richness of L2 written productions and the validity of vocabulary tests’ in Culhane T. , Klein-Braley C. , Stevenson D. K. (eds): Practice and Problems in Language Testing . Department of Language and Linguistics, University of Essex . pp. 14 – 28 . Bachman L. F. , Palmer A. . 1996 . Language Testing in Practice . Oxford University Press . Bachman L. F. , Palmer A. . 2010 . Language Testing in Practice. Developing Language Assessment and Justifying their Use in the Real World . Oxford University Press . Bartning I. , Martin M. , Vedder I. (eds). 2010 . Communicative Proficiency and Linguistic Development: Intersections between SLA and Language Testing Research. EUROSLA Monograph Series, 1. Available at http://eurosla.org/monographs/EM01/EM01tot.pdf. Chapelle C. A. 1999 . ‘Validity in language assessment ,’ Annual Review of Applied Linguistics 19 : 254 – 72 . Google Scholar Crossref Search ADS Council of Europe . 2001 . (ed.). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Available at http://www.coe.int/t/dg4/linguistic/cadre1_en.asp. Council of Europe . 2004 . (ed.). Takala, S., F. Kaftandjieva, N. Verhelst, J. Banerjee, T. Eckes, and F. van der Schoot. Reference Supplement to the Preliminary Pilot Version of the Manual for Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Available at: www.coe.int/lang. Council of Europe . 2009 /2003. (ed.). North, B., N. Figueras, S. Takala, P. Van Avermaet and N. Verhelst. Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Manual. Preliminary Pilot Version. Available at: www.coe.int/lang. Cucchiarini C. , Strik H. , Boyes L. . 2000 . ‘‘ Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology ,’’ Journal of the Acoustical Society of America 107 : 989 – 99 . Google Scholar Crossref Search ADS PubMed De Jong N. H. , Groenhout R. , Schoonen R. , Hulstijn J. H. . 2013 . ‘ Second language fluency: speaking style or proficiency? Correcting measures of second language fluency for first language behavior ,’ Applied Psycholinguistics 36 : 223 – 43 . Google Scholar Crossref Search ADS De Mauro T. (ed.). 2000 . Grande dizionario italiano dell’uso . UTET . De Mauro T. , Mancini F. . 1993 . Lessico di frequenza dell’italiano parlato . Etaslibri . Dörnyei Z. , Scott M. L. . 1997 . ‘ Communication strategies in a second language: definitions and taxonomies. review article ,’ Language Learning 47 : 173 – 210 . Google Scholar Crossref Search ADS Eckes T. 2008 . ‘ Rater types in writing performance assessments: a classification approach to rater variability ,’ Language Testing 25 : 155 – 85 . Google Scholar Crossref Search ADS Foster P. , Tonkyn A. , Wigglesworth G. . 2000 . ‘ Measuring spoken language: a unit for all reasons ,’ Applied Linguistics 21 : 354 – 75 . Google Scholar Crossref Search ADS Fulcher G. 1996 . ‘ Does thick description lead to smart tests? A data-based approach to rating scale construction ,’ Language Testing 13 : 208 – 38 . Google Scholar Crossref Search ADS Fulcher G. 2003 . Testing Second Language Speaking . Longman and Pearson Education . Fulcher G. 2004 . ‘ Deluded by artifices? The common European framework and harmonization ,’ Language Assessment Quarterly an International Journal 1 : 253 – 66 . Google Scholar Crossref Search ADS Fulcher G. 2008 . ‘Criteria for evaluating language quality’ in: Shohamy E. , Hornberger N. H. (eds): Language Testing and Assessment: Encyclopedia of Language and Education , Vol. 7 . Springer . pp. 157 – 76 . Fulcher G. , Davidson F. , Kemp J. . 2011 . ‘ Effective rating scale development for speaking tests: performance decision trees ,’ Language Testing 28 : 5 – 29 . Google Scholar Crossref Search ADS Guiraud P. 1954 . Les caractères statistiques du vocabulaire . Presse Universitaires de France . Harrison J. , Barker F. . 2015 . English Profile in Practice . Cambridge University Press . Harsch C. 2005 . Der Gemeinsame Europäische Referenzrahmen für Sprachen: Leistung und Grenzen. Die Bedeutung des Referenzrahmens im Kontext der Beurteilung von Sprachvermögen am Beispiel des semikreativen Schreibens im DESI-Projekt. Available at http://opus.bibliothek.uni-augsburg.de/opus4/ frontdoor/index/index/docId/297. Hawkins J. A. , Filipovíc L. . 2012 . Criterial Features in L2 English: Specifying the Reference Levels of the Common European Framework . Cambridge University Press . Hilton H. 2008 . ‘ The link between vocabulary knowledge and L2 fluency ,’ Language Learning Journal 36 : 153 – 66 . Google Scholar Crossref Search ADS Hudson Th. 2005 . ‘ Current trends in assessment scales and criterion-referenced language assessment ,’ Annual Review of Applied Linguistics 25 : 205 – 27 . Google Scholar Crossref Search ADS Hulstijn J. H. 2007 . ‘ The shaky ground beneath the CEFR: quantitative and qualitative dimensions of language proficiency ,’ The Modern Language Journal 91 : 663 – 7 . Google Scholar Crossref Search ADS Hulstijn J. H. , Alderson C. , Schoonen R. . 2010 . ‘Developmental stages in second-language acquisition and levels of second-language proficiency: are there links between them?’ in Bartning I. , Martin M. , Vedder I. (eds): Communicative Proficiency and Linguistic Development . pp. 5 – 10 . Jones R. , Tschirner E. . 2006 . A Frequency Dictionary of German: Core Vocabulary for Learners . Routledge . Kane M. T. 2001 . ‘ Current concerns in validity theory ,’ Journal of Educational Measurement 38 : 319 – 42 . Google Scholar Crossref Search ADS Knoch U. 2007 . ‘Do empirically developed rating scales function differently to conventional rating scales for academic writing?,’ in J. S. Johnson (ed). Spaan Fellow Working Papers in Second or Foreign Language Assessment. University of Michigan, 5: pp. 1–36. Knoch U. 2009 . ‘ Diagnostic assessment of writing: a comparison of two rating scales ,’ Language Testing 26 : 275 – 304 . Google Scholar Crossref Search ADS Knoch U. 2011 . ‘ Rating scales for diagnostic assessment of writing: where should the criteria come from? ,’ Assessing Writing 16 : 81 – 96 . Google Scholar Crossref Search ADS Kormos J. 2006 . Speech Production and Second Language Acquisition . Erlbaum . Kuiken F. , Vedder I. (eds). 2014 . ‘ Special issue on assessing oral and written L2 performance: raters’ decisions, rating procedures and rating scales ,’ Language Testing 31 : 279 – 84 . Google Scholar Crossref Search ADS Lennon P. 1991 . ‘ Error. Some problems of definition, identification, and distinction ,’ Applied Linguistics 12 : 180 – 95 . Google Scholar Crossref Search ADS Lennon P. 2000 . ‘The lexical element in spoken second language fluency’ in Riggenbach H. , (ed.): Perspectives on Fluency . Michigan University Press , pp. 25 – 42 . Little D. 2007 . ‘ The common European framework of reference for languages: perspectives on the making of supranational language education policy ,’ The Modern Language Journal 91 : 645 – 55 . Google Scholar Crossref Search ADS Malvern D. , Richards B. , Chipere N. , Durán P. . 2008 . Lexical Diversity and Language Development. Quantification and Assessment . Palgrave Macmillan . McNamara T. , Hill K. , May L. . 2002 . ‘ Discourse and assessment ,’ Annual Review of Applied Linguistics 22 : 221 – 42 . Google Scholar Crossref Search ADS Messick S. 1989 . ‘Validity’ in Linn R. L. , (ed.): Educational Measurement . Macmillan and American Council on Education , pp. 13 – 103 . Nation P. 2001 . Learning Vocabulary in another Language . Cambridge University Press . North B. 1997 . ‘ Perspectives on language proficiency and aspects of competence ,’ Language Teaching 30 : 93 – 100 . Google Scholar Crossref Search ADS North B. 2000 . The Development of a Common Framework Scale of Language Proficiency . Peter Lang . North B. 2007 . ‘ The CEFR illustrative descriptors ,’ The Modern Language Journal 91 : 656 – 9 . Google Scholar Crossref Search ADS North B. 2014a . ‘ Putting the common European framework of reference to good use ,’ Language Teaching 47 : 228 – 49 . Google Scholar Crossref Search ADS North B. 2014b . The CEFR in Practice . Cambridge University Press . O’Loughlin K. 1995 . ‘ Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test ,’ Language Testing 12 : 217 – 37 . Google Scholar Crossref Search ADS Read J. 2000 . Assessing Vocabulary . Cambridge University Press . Read J. 2007 . ‘ Second language vocabulary assessment: current practice and new directions ,’ International Journal of English Studies 7 : 105 – 25 . Read J. , Chapelle C. . 2001 . ‘ A framework for second language vocabulary assessment ,’ Language Testing 18 : 1 – 32 . Google Scholar Crossref Search ADS Rohde L. 1985 . ‘Compensatory fluency: a study of spoken english produced by four Danish learners’ in Glahn E. , Holmen A. (eds): Learner Discourse . University of Copenhagen , pp. 43 – 69 . Schmid H. 1994 . ‘Probabilistic part-of-speech tagging using decision trees’ in Jones D. (ed.): Proceedings of the International Conference on New Methods in Language Processing . University of Manchester , pp. 44 – 9 . Schneider G. , North B. 2000 . Fremdsprachen können—was heißt das? Skalen zur Beschreibung, Beurteilung und Selbsteinschätzung der fremdsprachlichen Kommunikationsfähigkeit . Rüegger . Segalowitz N. 2010 . Cognitive Bases of Second Language Fluency . Routledge . Tschirner E. 2005 . ‘ Das ACTFL OPI und der Europäische Referenzrahmen ,’ Babylonia 2 : 50 – 5 . Tschirner E. 2010 . Grund- und Aufbauwortschatz Italienisch nach Themen . Cornelsen . Wisniewski K. (forthcoming). ‘Empirical learner language and the levels of the common European framework of reference,’ Language Learning. Wisniewski K. 2010 . ‘ Bewertervariabilität im Umgang mit GeRS-Skalen. Ein- und Aussichten aus einem Sprachtestprojekt ,’ Deutsch Als Fremdsprache 3 : 143 – 50 . Wisniewski K. 2012 . Lexikalische Kompetenz in der Fremdsprache testen: Ein Modellierungsversuch , In Abel A. , Vettori C. , Wisniewski K. (eds). pp. 24 – 49 . Wisniewski K. 2013 . ‘The empirical validity of the CEFR fluency scale: the A2 level description’ in Galaczi E. D. , Weir C. (eds): Exploring Language Frameworks: Proceedings of the ALTE Krakow Conference, July 2011 . Cambridge University Press , 253 – 72 . Wisniewski K. 2014 . Die Validität der Skalen des Gemeinsamen europäischen Referenzrahmens für Sprachen. Eine empirische Untersuchung der Flüssigkeits- und Wortschatzskalen des GeRS am Beispiel des Italienischen und des Deutschen . Peter Lang . Wisniewski K. , Schöne K. , Nicolas L. , Vettori C. , Boyd A. , Meurers D. , Hana J. , Abel A. . 2013 . ‘MERLIN: an online trilingual learner corpus empirically grounding the CEFR reference levels in authentic data’ in Conference Proceedings 2013: ICT for Language Learning. Libreriauniversitaria, pp. 12 – 16 . Available at http://conference.pixel-online.net/ICT4LL2013/conferenceproceedings.php Wray A. 2002 . Formulaic Language and the Lexicon . Cambridge University Press . Appendix Table A1: Level descriptions and operationalized scale variables Notes: Descriptors and scale variables derived from the level descriptions (A2-B2) of the three CEFR scales. Table A1: Level descriptions and operationalized scale variables Notes: Descriptors and scale variables derived from the level descriptions (A2-B2) of the three CEFR scales. Table A2: Fluency Measures Fluency aspect Measure Reference (selection) Temporal phenomena Speech rate (syllables/minute) De Jong et al. 2013 Articulation rate (syllables/minute pauses excluded) Cucchiarini et al. (2000) (Exact) phonation-time ratio Segalowitz 2010 Mean length of run (utterances between pauses >250ms) " Pauses/minute and syllable " Mean length of pauses >250 ms Kormos 2006, Segalowitz 2010 Number and length of pauses in different positions (constituent-internal vs clause boundaries) Hilton 2008 Number, mean length, % of all pauses of differently caused pauses CEFR Strategies Message abandonment/token Dörnyei and Scott 1997 for all Circumlocution/token Approximation/token Use of allpurpose words/token Word coinage/token Literal translation/token Foreignizing/token Code-switching/token Verbal strategy marker/token Direct appeal for help/token Use of fillers/token Hesitation phenomena Repetitions/token Kormos 2006 Reformulations/token Fulcher 1996 Self-repairs/token Kormos 2006 False starts/token Rohde 1985 Filled pauses/token Kormos 2006, Segalowitz 2010 Phonetical lengthenings/token Rohde 1985 Keep going/token (keep speaking after a lexical problem) CEFR Fluency aspect Measure Reference (selection) Temporal phenomena Speech rate (syllables/minute) De Jong et al. 2013 Articulation rate (syllables/minute pauses excluded) Cucchiarini et al. (2000) (Exact) phonation-time ratio Segalowitz 2010 Mean length of run (utterances between pauses >250ms) " Pauses/minute and syllable " Mean length of pauses >250 ms Kormos 2006, Segalowitz 2010 Number and length of pauses in different positions (constituent-internal vs clause boundaries) Hilton 2008 Number, mean length, % of all pauses of differently caused pauses CEFR Strategies Message abandonment/token Dörnyei and Scott 1997 for all Circumlocution/token Approximation/token Use of allpurpose words/token Word coinage/token Literal translation/token Foreignizing/token Code-switching/token Verbal strategy marker/token Direct appeal for help/token Use of fillers/token Hesitation phenomena Repetitions/token Kormos 2006 Reformulations/token Fulcher 1996 Self-repairs/token Kormos 2006 False starts/token Rohde 1985 Filled pauses/token Kormos 2006, Segalowitz 2010 Phonetical lengthenings/token Rohde 1985 Keep going/token (keep speaking after a lexical problem) CEFR Table A2: Fluency Measures Fluency aspect Measure Reference (selection) Temporal phenomena Speech rate (syllables/minute) De Jong et al. 2013 Articulation rate (syllables/minute pauses excluded) Cucchiarini et al. (2000) (Exact) phonation-time ratio Segalowitz 2010 Mean length of run (utterances between pauses >250ms) " Pauses/minute and syllable " Mean length of pauses >250 ms Kormos 2006, Segalowitz 2010 Number and length of pauses in different positions (constituent-internal vs clause boundaries) Hilton 2008 Number, mean length, % of all pauses of differently caused pauses CEFR Strategies Message abandonment/token Dörnyei and Scott 1997 for all Circumlocution/token Approximation/token Use of allpurpose words/token Word coinage/token Literal translation/token Foreignizing/token Code-switching/token Verbal strategy marker/token Direct appeal for help/token Use of fillers/token Hesitation phenomena Repetitions/token Kormos 2006 Reformulations/token Fulcher 1996 Self-repairs/token Kormos 2006 False starts/token Rohde 1985 Filled pauses/token Kormos 2006, Segalowitz 2010 Phonetical lengthenings/token Rohde 1985 Keep going/token (keep speaking after a lexical problem) CEFR Fluency aspect Measure Reference (selection) Temporal phenomena Speech rate (syllables/minute) De Jong et al. 2013 Articulation rate (syllables/minute pauses excluded) Cucchiarini et al. (2000) (Exact) phonation-time ratio Segalowitz 2010 Mean length of run (utterances between pauses >250ms) " Pauses/minute and syllable " Mean length of pauses >250 ms Kormos 2006, Segalowitz 2010 Number and length of pauses in different positions (constituent-internal vs clause boundaries) Hilton 2008 Number, mean length, % of all pauses of differently caused pauses CEFR Strategies Message abandonment/token Dörnyei and Scott 1997 for all Circumlocution/token Approximation/token Use of allpurpose words/token Word coinage/token Literal translation/token Foreignizing/token Code-switching/token Verbal strategy marker/token Direct appeal for help/token Use of fillers/token Hesitation phenomena Repetitions/token Kormos 2006 Reformulations/token Fulcher 1996 Self-repairs/token Kormos 2006 False starts/token Rohde 1985 Filled pauses/token Kormos 2006, Segalowitz 2010 Phonetical lengthenings/token Rohde 1985 Keep going/token (keep speaking after a lexical problem) CEFR Table A3: Lexical Measures Indicator Reference (selection) Guiraud‘s Index Guiraud 1954 Advanced Guiraud’s Index Malvern et al. 2008 Lexical density indicator O’Loughlin 1995 Weighted lexical density As above % First k ITA/GER Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010 % Rare words (<4,000) As above % Basic vocabulary (>4,000) As above Lexical error annotation Form-related lexical errors Meaning-related lexical Target hypotheses Domain & extent of errors Target language modification L3/L1 influence Systematicity Lennon 1991; Nation 2001, Wisniewski 2012, 2014; Wray 2002 Error ratio 1 (lexical error types/token) Error ratio 2 (lexical error tokens/token) % error-free AS units Foster et al. 2000 Lexical Quality Indicator Arnaud 1984 Lexical errors in 1,000/4,000 most frequent words and rare words (<4,000)/token Indicator Reference (selection) Guiraud‘s Index Guiraud 1954 Advanced Guiraud’s Index Malvern et al. 2008 Lexical density indicator O’Loughlin 1995 Weighted lexical density As above % First k ITA/GER Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010 % Rare words (<4,000) As above % Basic vocabulary (>4,000) As above Lexical error annotation Form-related lexical errors Meaning-related lexical Target hypotheses Domain & extent of errors Target language modification L3/L1 influence Systematicity Lennon 1991; Nation 2001, Wisniewski 2012, 2014; Wray 2002 Error ratio 1 (lexical error types/token) Error ratio 2 (lexical error tokens/token) % error-free AS units Foster et al. 2000 Lexical Quality Indicator Arnaud 1984 Lexical errors in 1,000/4,000 most frequent words and rare words (<4,000)/token Table A3: Lexical Measures Indicator Reference (selection) Guiraud‘s Index Guiraud 1954 Advanced Guiraud’s Index Malvern et al. 2008 Lexical density indicator O’Loughlin 1995 Weighted lexical density As above % First k ITA/GER Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010 % Rare words (<4,000) As above % Basic vocabulary (>4,000) As above Lexical error annotation Form-related lexical errors Meaning-related lexical Target hypotheses Domain & extent of errors Target language modification L3/L1 influence Systematicity Lennon 1991; Nation 2001, Wisniewski 2012, 2014; Wray 2002 Error ratio 1 (lexical error types/token) Error ratio 2 (lexical error tokens/token) % error-free AS units Foster et al. 2000 Lexical Quality Indicator Arnaud 1984 Lexical errors in 1,000/4,000 most frequent words and rare words (<4,000)/token Indicator Reference (selection) Guiraud‘s Index Guiraud 1954 Advanced Guiraud’s Index Malvern et al. 2008 Lexical density indicator O’Loughlin 1995 Weighted lexical density As above % First k ITA/GER Jones and Tschirner 2006; De Mauro and Mancini 1993; De Mauro 2000, Tschirner 2010 % Rare words (<4,000) As above % Basic vocabulary (>4,000) As above Lexical error annotation Form-related lexical errors Meaning-related lexical Target hypotheses Domain & extent of errors Target language modification L3/L1 influence Systematicity Lennon 1991; Nation 2001, Wisniewski 2012, 2014; Wray 2002 Error ratio 1 (lexical error types/token) Error ratio 2 (lexical error tokens/token) % error-free AS units Foster et al. 2000 Lexical Quality Indicator Arnaud 1984 Lexical errors in 1,000/4,000 most frequent words and rare words (<4,000)/token © Oxford University Press 2017 This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

Applied Linguistics – Oxford University Press

Published: Dec 1, 2018

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The Empirical Validity of the Common European Framework of Reference Scales. An Exemplary Study for the Vocabulary and Fluency Scales in a Language Testing Context

The Empirical Validity of the Common European Framework of Reference Scales. An Exemplary Study for the Vocabulary and Fluency Scales in a Language Testing Context

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The Empirical Validity of the Common European Framework of Reference Scales. An Exemplary Study for the Vocabulary and Fluency Scales in a Language Testing Context

The Empirical Validity of the Common European Framework of Reference Scales. An Exemplary Study for the Vocabulary and Fluency Scales in a Language Testing Context

References (54)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies