Abstract Relying on data from three General Social Survey (GSS) panel studies, conducted in 2006/2008/2010, 2008/2010/2012, and 2010/2012/2014, and building on previous work on this topic, we analyze single survey item reliability in a study of 496 survey questions. Focusing specifically on questions about subjective content, we assess the impact of the number of response categories and rating approach (bipolar versus unipolar) on the reliability of the measurements. Contrary to information theory, our results indicate that the most reliable measurement results when fewer response categories are used, and generally, there is a monotonic decline in levels of reliability with greater numbers of response options. There is a strong suggestion in these data that questions using response formats with middle categories (especially 3-category scales) introduce more measurement errors. These results should be of considerable interest to survey and other researchers who routinely rely on single questions to measure subjective phenomena. 1. INTRODUCTION A great deal has been written about what are desirable attributes of good questions and questionnaires. Several useful contributions have been made to document both the art and science of constructing good survey questions and their evaluation. Improvements in this field have been enhanced most recently by an emphasis on cognitive and comprehension factors in responses to survey questions, as they interact with attributes of the question (see Sudman and Bradburn 1974, 1982; Schuman and Presser 1981; Jabine, Straf, Tanur, and Tourangeau 1984; Converse and Presser 1986; Krosnick and Alwin 1987, 1988, 1989; Saris and Andrews 1991; Tanur 1992; Sudman, Bradburn, and Schwarz 1996; Krosnick and Fabrigar 1997; Sirken, Herrmann, Schechter, Schwarz, Tanur, et al. 1999; Tourangeau, Rips, and Rasinski 2000; Schaeffer and Presser 2003; Saris and Gallhofer 2007; Madans, Miller, Maitland, and Willis 2011; Schaeffer and Dykema 2011; Krosnick and Presser 2010). In this paper, we consider the issue of the number of response categories for the measurement of subjective variables in surveys as one factor to consider in judging the best qualities of a survey question. We evaluate this with respect to the reliability or precision of measurement using a large number of questions that vary in their number of response options. The consideration of reliability of measurement as a criterion for evaluating survey measures was emphasized in a recent edited collection on question evaluation methods (Madans et al. 2011). These authors observed that in order for survey data to be “viewed as credible, unbiased, and reliable,” it is necessary to develop best practices regarding indicators of quality. They explicitly suggested that in addition to indicators of the quality of sample designs, response rates, and other characteristics of the sample, survey researchers should also focus on the quality of the measurement (Madans et al. 2011, p. 2), invoking the quality standards of reliability and validity, and they urged the implementation of designs that permit their evaluation. Following these considerations, we employ the type of robust research design and a set of statistical models that permit the estimation of the level of measurement error in survey data (see Alwin 2007, 2016). This approach allows us to bring theory and data to bear on a variety of different issues related to the quality of survey measurement. With respect to the optimal number of response categories for the measurement of subjective content in survey research, there are a variety of opinions. One view is that the best approach is to construct composite scales based on responses to simple dichotomous questions (McKennell 1973). The idea is that such questions are easy for respondents to understand, they are relatively reliable, and the use of composite scores makes up for whatever lack of precision might result from such coarse categorization of response. Another view, based on information theory, is that questions that utilize a larger number of categories contain greater amounts of information and that more categories result in a higher level of measurement precision (Alwin 1992; Krosnick and Fabrigar 1997). In addition to the number of response categories, there has been a persistent question of whether it is helpful to provide a middle category, (i.e., using three categories rather than two, or using five categories rather than four, and so forth), such that adding more categories involves adding a middle category. We hypothesize that measurement quality may be lower with the inclusion of middle alternatives, at least in some cases. In the case of three versus two categories, respondents may choose a middle category because it involves less effort, and this may provide an option in the face of uncertainty. In other words, three-category scales will be less reliable than two- and four-category scales. This may be somewhat less of a problem in the use of five-category scales because the provision of weak positive and weak negative options may reduce the potential ambiguity between true neutrality and weak positive and negative attitudes. We hypothesize the inclusion of a middle alternative in five-category scales only marginally reduces reliability. 2. THEORY AND BACKGROUND The survey interview involves information communication—in this case, communication about the respondent’s subjective states, namely their attitudes and beliefs. Traditionally, it has been assumed that most respondents have enough information to reliably report on their own attitudes and beliefs, but it is increasingly argued that access to such implicit schemas may be limited (Wittenbrink and Schwarz 2007). An attitude is by definition a latent variable, which makes it more difficult to measure. It is an unobserved tendency to behave positively or negatively along a continuum (e.g., to approve or disapprove, to agree or disagree, to favor or oppose, etc.) toward an attitude object (attitude objects can be almost anything: groups, policies, or social actors, about which people have positive or negative feelings, e.g., women, abortion rights, government helping minority groups or the poor, gun control). Beliefs are similarly latent, unobserved assessments of “what is” true or “what is representative” concerning an attitude object (e.g., women are qualified to be president, the government should help the poor, or it is possible to get ahead in the United States.). The type of belief about “what is desirable” is conventionally referred to as a value, a type of non-existential belief, whereas other beliefs are simply statements of “what is” (i.e., existential beliefs). The development of good survey questions to assess beliefs and attitudes depends intimately on how the concepts are defined. Attitudes are generally assumed to have both direction and intensity, the latter often referred to as attitude strength; and this is also the case with beliefs, that is, not only do beliefs have direction, they also have intensity (how much something is true or desirable).1 One approach that has been employed to measure attitudes was introduced by Likert (1932), who suggested that they could be measured relatively easily by presenting respondents with a five-category scale that included the measurement of three elements relating to the attitude concept: the direction of the attitude (agree versus disagree, approval versus disapproval etc.), the strength of the attitude (e.g., agree versus strongly agree and disagree versus strongly disagree), and a neutral point (neither agree nor disagree) for those respondents who could not choose between alternate poles. Likert did not suggest offering an explicit “Don’t know” response to distinguish between those people who had no opinion and those who were truly neutral, but this practice has become a well-accepted strategy in modern survey methods and has come to be associated with this approach (see Converse 1964; Alwin 2007). Likert’s approach is recognized as one of the most practical strategies for measurement of attitudes and other subjective variables in surveys. Any investigation into the best way to measure attitudes and other subjective content in surveys must consider this approach. His approach avoided some of the more cumbersome (although perhaps more theoretically elegant) psychophysical scaling techniques introduced earlier by Thurstone (see Thurstone 1927; Jones and Thurstone 1955; Rost 1988), and as Likert argued, his approach gave results very similar to those techniques (see Kish 1982). The “Likert scale” eventually became the textbook approach to measuring attitudes and beliefs in survey research methods (see Schuessler 1971; Moser and Kalton 1972). Such question forms have been adopted throughout the world, and although the term “Likert scale” is often used more broadly to refer to any bipolar survey question (regardless of the number of categories) that attempts to assess the direction and strength of attitudes, it should be noted that Likert did not propose rating scales of more than five categories. In fact, in his initial work (Likert 1932, pp. 15–20), he presented three-point and five-point scales almost exclusively. He referred to these as “three-point statements” and “five-point statements” (Likert 1932, p. 21). We mention Likert’s (1932) approach because it helps provide a framework for discussing issues concerning the optimal number of response categories (and the use of middle categories) to use in the measurement of attitudes and other subjective phenomena. In order to distinguish between Likert’s original ideas and the approaches that have introduced modifications, we use the term “Likert scale” for a survey question composed of an ordered five-category measure, using a set of agree-disagree (or approve-disapprove) categories, plus a neutral category. There are several different forms such questions take, and there are subtle differences in how this approach is used. To be clear, here are some examples of five-category Likert questions (not necessarily linked to one another in terms of what they are measuring) from the GSS data (which we employ here), all using several slightly different formats: “Please tell me whether you strongly agree, agree, neither agree nor disagree, disagree, or strongly disagree that … because of past discrimination, employers should make special efforts to hire and promote qualified women.” “Please tell me whether you strongly agree, agree, neither agree nor disagree, disagree, or strongly disagree with the following statement: The way things are in America, people like me and my family have a good chance of improving our standard of living.” “Do you agree or disagree with the following statement: Homosexual couples have the right to marry one another … strongly agree, agree, neither agree nor disagree, disagree, strongly disagree.” “Family life often suffers because men concentrate too much on their work … strongly agree, agree, neither agree nor disagree, disagree, strongly disagree.” “Do you agree or disagree that differences in income in America are too large. Would you say you strongly agree, agree, neither agree nor disagree, disagree, strongly disagree.” All of these questions employ a five-category bipolar (agree-disagree) format with a neutral category (neither agree nor disagree). Most approaches to measuring an attitude conceptualize the response scale as representing such an underlying attitude continuum, from positive to negative, and the response categories employed are intended to correspond to points along this attitude continuum. Thus, in the measurement of attitudes, as in the Likert approach (above), researchers often attempt to obtain an assessment of direction and intensity simultaneously (including the neutral position). Two- and three-category scales cannot assess intensity on that continuum, but they can assess direction, and in the latter case, a “neutral” point can be identified. In many other cases, both direction and intensity can be obtained from the typical approaches to attitude measurement, but the argument made here is that the choice of response options should be governed by the purpose of measurement (see Alwin 1992). If one wants to measure both direction of the attitude and its intensity, the survey response continuum should reflect this, and so too, if the interest is only in the direction of the attitude, this should dictate the nature of the question. Not all forced-choice rating-type questions reflect the measurement of a bipolar concept—some are clearly unipolar in character. Such unipolar scales include a “zero point” at one end of the scale rather than a “neutral” point. Thus, unipolar evaluations typically assess questions of “how many?” “how much?” or “how often?” where the zero-point is associated with categories such as “none,” “not at all,” and “never.” To be clear, unipolar measures are neither Likert scales nor Likert-type scales. In other words, Likert’s approach (and modifications of it) was intended strictly for measuring bipolar concepts. In further consideration of the distinction between bipolar versus unipolar response formats, we should note that, whereas we often find that some types of content, such as attitudes, are almost always measured using bipolar scales, others, such as those illustrated below are more typically measured using unipolar scales. Some constructs, however, can be measured either way. For example, scales that range from “completely satisfied” to “not at all satisfied” are considered to be unipolar. By contrast, a scale with endpoints of “completely satisfied” and “completely dissatisfied” is considered to be bipolar and has an explicit or implicit neutral point. The choice between them should in theory be based largely on what concept the investigator is interested in measuring—a bipolar one or a unipolar one—which is rooted in the theoretical construct of interest. The key distinction we make here between “unipolar” versus “bipolar” rating approaches refer not to the form of the question so much as to the form of the content being measured. In the case of the unipolar scale, the question addressed is typically “how much” of something, “how likely” for something to happen, “how satisfied” the respondent is with something, etc., as in the following examples: “There’s been a lot of discussion about the way morals and attitudes about sex are changing in this country. If a man and a women have sex relations before marriage, do you think it is always wrong, almost always wrong, wrong only sometimes, or not wrong at all?” “Thinking about the next twelve months, how likely do you think it is that you will lose your job or be laid off—very likely, fairly likely, not too likely, or not at all likely?” These are examples where the measurement is in one direction only—assessing the amount of something, e.g., how likely something is, or how wrong something is (and in this case, not how right something is). In all cases, the underlying concept being assessed does not have bi-directionality, and this is reflected in the measure. Past research indicates that in general unipolar response scales have somewhat higher reliabilities than bipolar rating scales (see Alwin 2007, pp. 191–4). We suspect that this is related to the fact that, whereas unipolar response scales are used when the researcher wants information on a single unitary concept (e.g., the frequency or amount of some quantity), in bipolar questions, several concepts are required for the respondent to handle simultaneously (namely neutrality, direction, and intensity). There is a literature that has focused on the optimal number of response categories (see Maitland 2009). Results are somewhat mixed, although there is strong support for the hypothesis that more is better. Using a MTMM design and measuring the degree of satisfaction with various domains of life, using bipolar scales of seven and eleven categories, Alwin (1997, 2007) concluded that the information theoretic hypothesis was supported and rejected the idea that questions with greater numbers of response categories are more vulnerable to systematic measurement errors due to response scale method effects.2 This pattern is inconsistent with the results of a recent MTMM study of the number of categories used for attitude measurement in the European Social Survey (ESS). Revilla, Saris and Krosnick (2014) used an elaborate split-ballot MTMM comparison of 5-, 7- and 11-response category measures of attitudes using face-to-face interviews in twenty-three of the twenty-five countries in the ESS, they found the most support for the quality of five-category response scales, thereby rejecting the claims of the information-theoretic argument applied to questions with greater numbers of response categories. Further research is necessary that controls for the content of the questions—for measures of the concept of life satisfaction it appears that more is better, but the same conclusion may not hold for the measurement of attitudes. These conclusions may not apply to the measurement of unipolar concepts (see below), since the measurement of such concepts using closed-form questions rarely involves more than five categories. Evaluating the differences in reliability of measurement across categories of different lengths in Alwin’s (2007) analysis of more than 340 measures of subjective content revealed the superiority of four- and five-category scales for unipolar concepts, a finding that supports the kind of “information theoretic” logic mentioned earlier. For measures of bipolar concepts, the two-category scale continues to show the highest levels of measurement reliability. There are many fewer substantive differences among the bipolar measures, although it is relatively clear that longer scales have slightly lower reliabilities (Alwin 2007, pp. 193–4). The issue of the number of response categories includes a subtheme of the inclusion of “middle categories” in the measurement of attitudes. In brief, this literature suggests that three-category questions are less reliable than two- and four-category questions, in part, we suspect, because the introduction of a middle alternative presents room for ambiguity between positive and negative options (Schuman and Presser 1981). Two types of ambiguity exist. First, at the level of the latent attitude, the individual’s region of neutrality may vary and in some cases may be quite wide. In other words, neutrality is not necessarily a discrete point on a latent attitude continuum. It may be a region between acceptability and unacceptability of the attitude object, and there may be considerable ambiguity about the difference between neutrality and weak positive and negative attitudes (see Alwin and Krosnick 1991; Alwin 1992, pp. 93–4). The respondent may resolve this ambiguity essentially by chance; thus, three-category scales may introduce more randomness in responses. Second, as others have observed, the response scale at the manifest level, may produce certain ambiguities. Even if the internal attitude is clear, the respondent may find it difficult to translate the attitude into the language used in the response categories, and in some cases, use the middle category as a way of saying “I don’t know” (see Tourangeau, Rips, and Rasinski 2000; Sturgis, Roberts, and Smith 2014). Research has also shown that middle alternatives are more often chosen when they are explicitly offered than when they are not (Schuman and Presser 1981), suggesting that the meaning of the response categories may stimulate the response. When it is offered the respondent may choose a middle category because it requires less effort and may provide an option in the face of uncertainty (Krosnick and Alwin 1989; Alwin 1991). Because of the strong potential for these ambiguities in the response process, we hypothesize that three-category scales are less reliable than two- and four-category scales. On the other hand, five-category scales do not create the same problems because at the manifest level they provide weak positive and weak negative categories, thereby giving the respondent a better opportunity to distinguish between true neutrality and weak forms of positive and negative attitudes. 3. METHODS The present study is concerned with the relationship between the accuracy/consistency of measurement and the nature of response categories used in the measurement of attitudes (and/or subjective phenomena generally) in contemporary surveys. In this section we provide a discussion of (1) the requirements of our study design, (2) the data we employ, and (3) our analytic approach to estimating reliability. 3.1 Study Design Our study design requires the use of large-scale panel studies that are representative of known populations, with a minimum of three waves of measurement separated by two-year re-interview intervals. Questions were selected for use if they were exactly replicated (exact wording, response categories, mode of interviewing, etc.) across the three waves, and if the underlying variable measured was continuous (rather than categorical) in nature. One of the main advantages of the re-interview or panel design using long (two-year) re-interview intervals is that, under appropriate circumstances, it is possible to eliminate the confounding of the systematic and random error components. In the panel design, by definition, measurement is repeated. And memory, or other systematic sources of error, must be ruled out. So, while this overcomes one limitation of cross-sectional surveys, namely the failure to meet the assumption of the independence of errors, it presents problems if respondents can remember what they said in a previous interview and are motivated to provide consistent responses (Moser and Kalton 1972). Given the difficulty of estimating memory functions, estimation of reliability from re-interview designs makes sense only if one can rule out memory as a factor in the covariance of measures over time, and thus, the occasions of measurement must be separated by sufficient periods of time to rule out the operation of memory. In cases where the re-measurement interval is insufficiently large enough to permit appropriate estimation of the reliability of the data, the estimate of the amount of reliability will most likely be inflated (see Alwin 1989; 1992; Alwin and Krosnick 1991), and the results of these studies suggest that longer re-measurement intervals, such as those employed here, are highly desirable. 3.2 Data and Samples The principal focus of the present report is on the General Social Survey (GSS). Data was collected on representative samples of the US population, with re-interviews implemented at two-year intervals (Smith et al. 2015). The General Social Survey (GSS) is a face-to-face interview measuring attitudinal and demographic change of a representative sample of households in the United States. Funded by the National Science Foundation (NSF) and conducted by NORC at the University of Chicago. The GSS collected data annually between 1972 and 1994 (with the exception of 1979, 1981, and 1992) and biennially since 1994 (Smith et al. 2015). Beginning in 2006, the GSS began to implement a rolling panel design, including three waves of data collection, with re-interviews occurring two years following the first, and the final interview occurring four years after the initial interview. This data can be used as either panel data or cross-sectional data (see General Social Survey 2011, 2013, 2015). In 2006, 4,510 respondents were interviewed. A subsample of 2,000 respondents were selected for re-interview in 2008, with 1,536 respondents participating. In the final wave of data collection for this panel (occurring in 2010), there were 1,276 respondents. The panel that began in 2008 included 2,023 respondents and attempted to re-interview all respondents in wave two (2010) and wave three (2012), with 1,581 and 1,295 respondents participating, respectively. The third panel began in 2010 and included 2,044 respondents in the first wave. In the second wave (2012), 1,551 respondents participated, and the final wave (2014) included 1,304 respondents (see Hout and Hastings 2016). 3.3 Measures Since the inception of the GSS, each interview has included a set of core demographic, behavioral, and attitudinal measures, a number of which have remained unchanged over time. These measures are part of the GSS “replicating core” and fall into two categories: socio-demographic/background measures and social and political attitudes and behaviors. Socio-demographic and background measures include life course data; work/employment data; and spousal, household, and parental socioeconomic data. Social/political attitudinal and behavioral measures include, but are not limited to, religious and political attitudes and behaviors, and attitudes about suicide, crime, gender, race, family, sexual behaviors, and vocabulary knowledge. Although “core” measures have been part of the GSS since its inception, some measures have been discontinued, and new measures added over time. Additionally, some items have been repeated in multiple waves based on agreements with other agencies, including modules from the International Social Survey Program (ISSP). In this paper, we present reliability estimates for 496 questions from the GSS panel surveys focusing on subjective (i.e., non-factual) questions. Table A1 in the Online Supplemental Materials lists, by GSS mnemonic, all the GSS questions used in this analysis. The actual question wordings to these questions can be found in the GSS materials (codebooks, questionnaires, etc.) available online. The GSS questions listed in Table A1 (on Online Supplemental Materials) are presented according to the number of categories used, specifically the two-, three-, four-, and five-category “Likert” and “Likert-type” questions and the four-category unipolar questions, as well as all other questions used from the GSS data. Table A2 in Online Supplemental Materials presents results of the reliability estimates, using both listwise and full-information approaches to missing data. 3.4 Analysis With three waves of panel data, and if certain other assumptions are met, one can use the three-wave quasi-simplex model (see Heise 1969; Jöreskog 1970; Wiley and Wiley 1970; Wiggins 1973; Alwin 2007) to obtain estimates of the reliability of measurement for individual survey questions. The focus is on the single survey question rather than on composite measures. This approach also rejects the simple two-wave test-retest approach by employing three waves of data, rather than two, which permits the model to account for both unreliability and true change in latent variable of interest. This three-wave panel design has been very successful in separating unreliability from true change, allowing the evaluation of the extent of random measurement error in survey measures (e.g., see Bohrnstedt, Mohler, and Müller 1987; Alwin and Krosnick 1991; Alwin 2007, 2010, 2011, 2015, 2016; Alwin, Zeiser, and Gensimore 2014; Alwin and Beattie 2016). As noted, we include survey measures of continuous variables only, and within this class of variables, we implement estimates of reliability that are independent of scale properties of the observed measures, which may be dichotomous, polytomous-ordinal, or interval. In each of these cases, the analysis employs a different estimate of the covariance structure of the observed data, but the model for reliability is the same. That is, when the variables are dichotomies, the appropriate covariance structure used in reliability estimation is based on tetrachoric correlations (Jöreskog 1990, 1994; Muthén 1984); when the variables are polytomous-ordinal, the appropriate covariance structure is either the polychoric correlation matrix or the asymptotic covariance matrix based on polychoric correlations; and when the variables can be assumed to be interval, ordinary Pearson-based correlations and covariance structures for the observed data are used (Muthén 1984; Brown 1989; Jöreskog 1990, 1994; Lee, Poon, and Bentler 1990). As noted, all of these models assume that the latent variable is continuous. The psychometric definition of reliability on which we rely here also requires that the estimate of error variance be independent of any true change in the quantity being measured (Lord and Novick 1968). The model we employ falls into a class of auto-regressive or quasi-Markov simplex models that specifies two structural equations for a set of p over-time measures of a given variable yt (where t = 1, 2 …p) as follows [see the path diagram for this model in Alwin (2007; figure 5.2 and pages 102–116): yt=τt+ɛt (1) τt=βt,t−1τt−1+ζt (2) The first equation represents a set of measurement assumptions indicating that (1) over-time measures are assumed to be τ-equivalent, except for true score change and (2) measurement error is random (see Heise 1969; Jöreskog 1970; Wiley and Wiley 1970; Alwin 1989, 2007, 2011). In this first equation, yt is the observed score at time t, τt is the latent unobserved true score at time t and εt is the latent unobserved random measurement error. The second equation specifies the causal processes involved in change of the latent variable over time. Here, the latent true score at a given time, τt, is a function of the true score at the previous time, τt−1, and a random disturbance term, ζt . It is important to note that this model assumes the latent variable will change over time and follows a lag-1 or Markovian process in which the distribution of the true variables at time t is dependent only on the distribution at time t−1 and not directly dependent on distributions of the variable at earlier times. If these assumptions do not hold, then this type of simplex model may not be appropriate. This is not an appropriate place to debate the usefulness of this model; however, we should note that the three-wave simplex model may have limitations in some cases. First, it makes the assumption that reliabilities are essentially constant over time. And second, it assumes that lagged effects of latent variables are non-existent (i.e., the Markovian assumption). For most survey questions, the assumption of time-invariant reliabilities does not appear to be much of a problem. This issue has been thoroughly investigated, and there are only trivial differences between approaches to handling this issue (see Alwin 2007). The second potential problem is the lag-1 feature of the model, i.e., the model does not allow for a direct effect of the latent variable at time-1 on the latent variable at time-3. Unfortunately, this assumption cannot be tested with only three waves. This is not a serious issue if re-interview intervals are of sufficient length to mitigate the problems with the assumption. The GSS design of two-year re-interview helps protect against the effects of memory, although there clearly can be memory effects in such surveys. Additionally, there are certain safeguards in the examination of results from these models, in that instances where the model does not work are typically clear and processes that do not follow a simplex pattern are very rare. Finally, we would note that four or more waves of panel data, even if they existed, would not necessarily solve the problem, in that additional attrition would be a problem, as well as other problems with the replication of measurement across longer periods of time. A final issue that arises in the use of three-wave quasi-simplex models to estimate the reliability of survey measures is how to handle missing data. As is commonly known, attrition is a perennial problem in the implementation of panel surveys. One approach—used almost exclusively in the monograph mentioned earlier (see Alwin 2007)—was “listwise” data present, that is, using only those cases that had data present in all three wave of the panel.3 An alternative explored in that monograph, however, was full-information maximum-likelihood (FIML), which estimates the correlations (either Pearson or polychoric) using all information present (see Allison 1987). This approach is statistically justified but can be misleading when there is not much data present across waves of the survey. Therefore, before using such an approach to estimate reliability of measurement, it is important to assess the extent of missing data. One useful indicator to evaluate is the “proportion of data present” across wave. This is a set of percentage figures routinely produced by software such as M-plus, which gives one an idea of how many cases have data across waves of the panel. In the current study, there were few questions where the proportion of data present across waves was below the twenty percent threshold. In the case of the “total sample” or “full-information” approach, we employed full-information maximum-likelihood (FIML) (Allison 1987; Wothke 2000) to handle missing data for continuous variables and weighted least squares mean- and variance-adjusted (WLSMV) (Asparouhov, Bength, and Muthén 2010) for ordinal variables. For both classes of variables, estimates of reliability were obtained using both listwise and full-information approaches. There may be some differences in the nature of the reliability estimates in the extreme cases, where little data is present across waves, but estimates based on listwise and total sample (FIML or WLSMV) approaches yield sufficiently similar estimates to alleviate any concerns about substantial differences. Consistent with prior studies, results indicate (see table A2) that listwise and WLSMV/FIML estimates were virtually identical, suggesting an MCAR (missing completely at random) pattern to attrition and missing data (see also, Alwin 2007; Alwin, Beattie, and Baumgartner 2015). There appears to be a very slight tendency for the listwise estimates to be higher, but this result is not statistically significant. Due to the virtually identical nature of these obtained results for the remainder of the paper, we present only one set of estimates in our analyses, specifically the estimates based on listwise data. 4. RESULTS In this section we present the results of several comparisons involving questions with differing numbers of response categories. Table 1 presents descriptive information on the pool of measures we employ from the GSS panels. The numbers in the body of the table are average reliabilities, with the number of measures on which they are based given in parentheses. All of the GSS questions are exactly replicated across the three panel studies, so we essentially have three replications contained within this study. The present paper focuses solely on non-factual questions, but we include reliability information here on the factual questions employed in the larger project in order to provide the broader context for these results. The present analysis is based on the non-factual questions in the three panel studies—168 from the 2006/2008/2010 panel, 166 from the 2008/2010/2012 panel, and 162 from the 2010/2012/2014 panel—yielding a total of 496 questions on which the present analysis is based (see the numbers given in table 1). As is well known, factual information is gathered with greater reliability in surveys than is non-factual information (Turner and Martin 1984; Alwin 1989, 2007; Alwin, Beattie, and Baumgartner 2015), and this is borne out by the information presented in table 1. In the GSS surveys, the typical question seeking factual information is measured with average reliability of approximately 0.85, which is relatively high. On the other hand, the average reliability for non-factual information gathered in the GSS is in the range 0.66 to 0.68. The comparisons between the estimated reliabilities of factual versus non-factual questions result in highly significant differences across all three panel studies (see Alwin, Beattie, and Baumgartner 2015). Table 1. Comparison of Reliability Estimates for Measures of Facts and Non-Facts Measured in three GSS Panel Studies Panel Study 2006-08-10 2008-10-2012 2010-12-2014 Facts 0.844 0.853 0.861 (33) (30) (29) Non-facts 0.677 0.664 0.685 (168) (166) (162) Beliefs 0.653 0.629 0.675 (64) (63) (60) Values 0.709 0.688 0.691 (37) (37) (37) Attitudes 0.679 0.686 0.679 (35) (35) (34) Self-assessments 0.668 0.680 0.706 (12) (12) (12) Self-perceptions 0.752 0.749 0.773 (15) (14) (14) Expectations 0.560 0.506 0.560 (6) (6) (6) Total 0.704 0.693 0.712 (201) (196) (191) Comparisons Facts vs Non-facts F-ratio 39.500 41.930 37.290 p-value 0.000 0.000 0.000 Content within non-facts F-ratio 2.350 3.400 2.010 p-value 0.044 0.006 0.081 Panel Study 2006-08-10 2008-10-2012 2010-12-2014 Facts 0.844 0.853 0.861 (33) (30) (29) Non-facts 0.677 0.664 0.685 (168) (166) (162) Beliefs 0.653 0.629 0.675 (64) (63) (60) Values 0.709 0.688 0.691 (37) (37) (37) Attitudes 0.679 0.686 0.679 (35) (35) (34) Self-assessments 0.668 0.680 0.706 (12) (12) (12) Self-perceptions 0.752 0.749 0.773 (15) (14) (14) Expectations 0.560 0.506 0.560 (6) (6) (6) Total 0.704 0.693 0.712 (201) (196) (191) Comparisons Facts vs Non-facts F-ratio 39.500 41.930 37.290 p-value 0.000 0.000 0.000 Content within non-facts F-ratio 2.350 3.400 2.010 p-value 0.044 0.006 0.081 Table 1. Comparison of Reliability Estimates for Measures of Facts and Non-Facts Measured in three GSS Panel Studies Panel Study 2006-08-10 2008-10-2012 2010-12-2014 Facts 0.844 0.853 0.861 (33) (30) (29) Non-facts 0.677 0.664 0.685 (168) (166) (162) Beliefs 0.653 0.629 0.675 (64) (63) (60) Values 0.709 0.688 0.691 (37) (37) (37) Attitudes 0.679 0.686 0.679 (35) (35) (34) Self-assessments 0.668 0.680 0.706 (12) (12) (12) Self-perceptions 0.752 0.749 0.773 (15) (14) (14) Expectations 0.560 0.506 0.560 (6) (6) (6) Total 0.704 0.693 0.712 (201) (196) (191) Comparisons Facts vs Non-facts F-ratio 39.500 41.930 37.290 p-value 0.000 0.000 0.000 Content within non-facts F-ratio 2.350 3.400 2.010 p-value 0.044 0.006 0.081 Panel Study 2006-08-10 2008-10-2012 2010-12-2014 Facts 0.844 0.853 0.861 (33) (30) (29) Non-facts 0.677 0.664 0.685 (168) (166) (162) Beliefs 0.653 0.629 0.675 (64) (63) (60) Values 0.709 0.688 0.691 (37) (37) (37) Attitudes 0.679 0.686 0.679 (35) (35) (34) Self-assessments 0.668 0.680 0.706 (12) (12) (12) Self-perceptions 0.752 0.749 0.773 (15) (14) (14) Expectations 0.560 0.506 0.560 (6) (6) (6) Total 0.704 0.693 0.712 (201) (196) (191) Comparisons Facts vs Non-facts F-ratio 39.500 41.930 37.290 p-value 0.000 0.000 0.000 Content within non-facts F-ratio 2.350 3.400 2.010 p-value 0.044 0.006 0.081 Table 1 also presents information pertaining to variation in reliability among non-facts by the type of content being measured. The numbers in parentheses in table 1 refer to the number of questions that fall within a particular category. These results show that there are marginally significant differences among types of non-factual content across the three panel studies, wherein self-perceptions are measured with about 0.75 reliability, and expectations measured with about 0.55 reliability. Here, we focus on the contributions of the number of response categories on reliability of measurement, controlling for content where it is justified (see Table A2 in Online Supplemental Materials). Table 2 presents reliability estimates on an array of different types of four- and five-category questions, which permit us to evaluate the reliabilities of Likert and Likert-type questions. These results permit several conclusions. First, there are significant differences between questions involving unipolar content (e.g., how much, how likely, how satisfied, anchored by the category of “none” or “not at all”), as compared to bipolar questions aimed at measuring both direction and intensity, and in the case of five-category questions, a neutral point. Overall, bipolar questions that include a neutral, or middle, category are no more or less reliable than are four-category questions. Finally, these results reveal no significant differences in measurement reliability among types of bipolar questions, that is, Likert measures versus Likert-type measures. There is a slight tendency for Likert measures to have higher reliabilities among 4-category questions, but this trend is not apparent among the five-category measures. The largest difference revealed in this table is not in the patterns found among the bipolar measures, but between the unipolar and bipolar questions. Table 2. Comparison of Reliability Estimates for Unipolar and Bipolar Likert and Likert-Type Closed-Form Non-Fact Measures Using Four- and Five-Category Response Options Number of response categories Types of measures Unipolar Bipolar-likert Bipolar likert-type Total bipolar 4 Categories 0.741 0.617 0.530 0.585 (30) (20) (12) (32) 5 Categories — 0.577 0.560 0.569 (33) (27) (60) Total 0.741 0.592 0.551 0.575 (30) (53) (39) (92) Comparisons Unipolar vs bipolar (4 Categories) F-ratio 30.010 p-value 0.000 Bipolar 4 vs 5 Categories F-ratio — 1.630 0.450 0.340 p-value — 0.208 0.505 0.564 Likert vs Likert type (4 Categories) F-ratio 5.430 p-value 0.027 Likert vs Likert type (5 Categories) F-ratio 0.300 p-value 0.584 Number of response categories Types of measures Unipolar Bipolar-likert Bipolar likert-type Total bipolar 4 Categories 0.741 0.617 0.530 0.585 (30) (20) (12) (32) 5 Categories — 0.577 0.560 0.569 (33) (27) (60) Total 0.741 0.592 0.551 0.575 (30) (53) (39) (92) Comparisons Unipolar vs bipolar (4 Categories) F-ratio 30.010 p-value 0.000 Bipolar 4 vs 5 Categories F-ratio — 1.630 0.450 0.340 p-value — 0.208 0.505 0.564 Likert vs Likert type (4 Categories) F-ratio 5.430 p-value 0.027 Likert vs Likert type (5 Categories) F-ratio 0.300 p-value 0.584 Table 2. Comparison of Reliability Estimates for Unipolar and Bipolar Likert and Likert-Type Closed-Form Non-Fact Measures Using Four- and Five-Category Response Options Number of response categories Types of measures Unipolar Bipolar-likert Bipolar likert-type Total bipolar 4 Categories 0.741 0.617 0.530 0.585 (30) (20) (12) (32) 5 Categories — 0.577 0.560 0.569 (33) (27) (60) Total 0.741 0.592 0.551 0.575 (30) (53) (39) (92) Comparisons Unipolar vs bipolar (4 Categories) F-ratio 30.010 p-value 0.000 Bipolar 4 vs 5 Categories F-ratio — 1.630 0.450 0.340 p-value — 0.208 0.505 0.564 Likert vs Likert type (4 Categories) F-ratio 5.430 p-value 0.027 Likert vs Likert type (5 Categories) F-ratio 0.300 p-value 0.584 Number of response categories Types of measures Unipolar Bipolar-likert Bipolar likert-type Total bipolar 4 Categories 0.741 0.617 0.530 0.585 (30) (20) (12) (32) 5 Categories — 0.577 0.560 0.569 (33) (27) (60) Total 0.741 0.592 0.551 0.575 (30) (53) (39) (92) Comparisons Unipolar vs bipolar (4 Categories) F-ratio 30.010 p-value 0.000 Bipolar 4 vs 5 Categories F-ratio — 1.630 0.450 0.340 p-value — 0.208 0.505 0.564 Likert vs Likert type (4 Categories) F-ratio 5.430 p-value 0.027 Likert vs Likert type (5 Categories) F-ratio 0.300 p-value 0.584 Table 3 presents results comparing reliability estimates for unipolar and bipolar questions that use two- and three-category questions. The unipolar versus bipolar difference is not revealed in the comparisons within two- and three-category questions, as shown in table 3. Here, however, there does appear to be an important difference between two- and three-category questions within unipolar and bipolar content. The typical non-factual question involving two response categories enjoys a level of reliability almost as high as that involving factual content. Across both unipolar and bipolar content assessments, the estimated reliabilities for two-category questions is about 0.78; whereas, for three-category questions, the reliability is significantly lower, in the range of about 0.64, regardless of whether unipolar or bipolar content is being measured. As seen in table 3, this is a highly significant result. Table 3. Comparison of Reliability Estimates for Unipolar and Bipolar Closed-Form Non-Fact Measures Using Two- and Three-Category Response Options Number of Polarity Total Unipolar vs Bipolar Response Categories Unipolar Bipolar F-ratio p-value 2 Categories 0.775 0.788 0.778 0.400 0.526 (132) (35) (167) 3 Categories 0.624 0.658 0.644 3.130 0.079 (60) (93) (153) Total 0.727 0.693 0.716 (192) (128) (326) Comparisons 2 vs 3 Categories F-ratio 71.360 35.570 117.770 p-value 0.000 0.000 0.000 Number of Polarity Total Unipolar vs Bipolar Response Categories Unipolar Bipolar F-ratio p-value 2 Categories 0.775 0.788 0.778 0.400 0.526 (132) (35) (167) 3 Categories 0.624 0.658 0.644 3.130 0.079 (60) (93) (153) Total 0.727 0.693 0.716 (192) (128) (326) Comparisons 2 vs 3 Categories F-ratio 71.360 35.570 117.770 p-value 0.000 0.000 0.000 Table 3. Comparison of Reliability Estimates for Unipolar and Bipolar Closed-Form Non-Fact Measures Using Two- and Three-Category Response Options Number of Polarity Total Unipolar vs Bipolar Response Categories Unipolar Bipolar F-ratio p-value 2 Categories 0.775 0.788 0.778 0.400 0.526 (132) (35) (167) 3 Categories 0.624 0.658 0.644 3.130 0.079 (60) (93) (153) Total 0.727 0.693 0.716 (192) (128) (326) Comparisons 2 vs 3 Categories F-ratio 71.360 35.570 117.770 p-value 0.000 0.000 0.000 Number of Polarity Total Unipolar vs Bipolar Response Categories Unipolar Bipolar F-ratio p-value 2 Categories 0.775 0.788 0.778 0.400 0.526 (132) (35) (167) 3 Categories 0.624 0.658 0.644 3.130 0.079 (60) (93) (153) Total 0.727 0.693 0.716 (192) (128) (326) Comparisons 2 vs 3 Categories F-ratio 71.360 35.570 117.770 p-value 0.000 0.000 0.000 We present the above results graphically by type of non-factual content (these graphic displays are completely consistent with the numbers presented in the tables). Figure 1 presents these patterns for measures of attitudes, beliefs, values, and expectations (ABVE), whereas figure 2 presents the results for measures of self-perceptions and self-assessments (SPSA). These figures exclude the reliability estimates for six-, seven- and nine-category questions (see table 1 in appendix), due to the fact they are small in number and not representative of questions in those classes. Although the patterns described above are present across both types of content, the results appear to be much stronger within the ABVE category. That is, as shown in figure 1, for both unipolar and bipolar content, there is a substantial difference in the reliabilities of two- and three-category measures, but thereafter, the reliability improves with increases in the number of categories among unipolar measures, but decreases among bipolar measures. With a few exceptions, the same pattern is apparent among the SPSA group of questions. Figure 1. View largeDownload slide Chart for Average Reliability of Measures of Attitudes, Beliefs, and Values by Polarity and Number of Response Categories. Figure 1. View largeDownload slide Chart for Average Reliability of Measures of Attitudes, Beliefs, and Values by Polarity and Number of Response Categories. Figure 2. View largeDownload slide Chart for Average Reliability of Measures of Self-Perceptions and Self-Assessments by Polarity and Number of Response Categories. Figure 2. View largeDownload slide Chart for Average Reliability of Measures of Self-Perceptions and Self-Assessments by Polarity and Number of Response Categories. Finally, we present results comparing the use of two versus three categories among questions in which the respondent was presented with only two categories, but the interviewers were instructed to record “middle category” responses that were volunteered. The results of these comparisons are summarized by the graph in figure 3. This graph summarizes the reliability estimates for nine two-category questions for which the GSS includes the volunteered middle category in the publicly released data set. The full set of results for these nine questions are given in table 2 in appendix. The summary of the results in figure 3 strongly suggests that the practice of recording volunteered middle category responses when they are not offered as part of the question is not good from the point of view of the reliability of responses. Including the volunteered middle category produces reliabilities of less than 0.7, whereas the treatment of the volunteered middle category as missing data produces reliabilities in excess of 0.8. These results are replicated across the two different approaches to the treatment of missing data. These patterns discourage the use of such volunteered middle category responses in analyses of these data. Figure 3. View largeDownload slide Chart for Average Reliability Estimates for Two-Category Non-Fact Questions with and without Volunteered Middle Category Responses by Type of Approach to Missing Data. Figure 3. View largeDownload slide Chart for Average Reliability Estimates for Two-Category Non-Fact Questions with and without Volunteered Middle Category Responses by Type of Approach to Missing Data. 5. DISCUSSION AND CONCLUSIONS We began this paper by recalling the approach advocated by Likert (1932) to the measurement of attitudes and other subjective phenomenon, which employed a strategy of measurement that assessed direction, intensity, and neutrality of attitudes. Likert’s approach to measuring attitudes and beliefs is recognized as one of the most practical strategies for measurement and is one of the most accepted and taken-for-granted approaches to the measurement of the direction of attitudes and beliefs (agree versus disagree), as well as a degree of the intensity, or strength, in either direction. Within this framework, this paper attempted to clarify the nature of the standard approaches to measuring attitudes and to show how they are used in the measurement of attitudes, beliefs, and other subjective states, using data from the General Social Survey panel studies. Our analysis described above examined various forms of questions used in these surveys, evaluating their performance against the criterion of consistency (i.e., reliability) of measurement, using data on 496 subjective measures from three separate three-wave panel studies conducted by the GSS over the past decade. In this paper, we contrasted several approaches to the measurement of attitudes in surveys. One of the important findings presented here is that limiting the measurement of attitudes to unipolar question forms rather than the traditional bipolar approach advocated by Likert may be superior with regard to reliability of measurement. Our results indicate that, consistent with past research on this question (see Alwin 2007, pp. 191–194), unipolar response scales have substantially higher reliabilities than bipolar rating scales. We argued that this is in part due to the fact that the task is simpler and respondents are being asked to do just one thing rather than multiple things. In other words, when unipolar response options are employed when the researcher wants information on a single unitary concept, e.g. the amount or frequency of a particular quantity, in contrast to the measurement of bipolar concepts, such as with Likert’s approach, where the respondent is asked to simultaneously consider the direction, intensity, and neutrality of his/her attitudes. This reinforces the conclusion, as argued by Alwin (1992, 2007), that explicit consideration of the purpose of measurement be given in the choice of response scales. If all one is interested in is the intensity of the attitude, then simpler forms of measurement may be optimal in terms of reliability. By contrast, if one wishes to measure direction only, the two-category response form may be the most practical. Unless the interest in assessing both direction and strength can be theoretically justified, simpler forms of measurement may be desirable. On the other hand, the combination of approaches may be the most defensible, but we have no evidence to support this here. With the possible exception of some unipolar scales, there does not appear to be any support for the idea that more categories are better, which is what would be predicted from information theory (see Alwin 1992). In the present results, the only case in which reliability of measurement increases monotonically with larger numbers of response categories is in the case of unipolar measures of attitudes. In all other cases, reliability declines with increasing the number of response options, which is supportive of related research (Revilla, Saris, and Krosnick 2014). It is difficult to generalize these findings given that there are other factors clearly at work. For example, many of the seven-category scales employed in the GSS are not fully labelled, and the use of partially labelled scales is linked to lower levels of reliability (see Alwin 2007, pp. 200–201; Alwin and Krosnick 1991). Perhaps the best illustration of this is the comparison of the two- and three-category measures, wherein the latter have substantially lower reliabilities. Our argument has been that the middle category in this situation introduces ambiguity into what otherwise would be a clear-cut choice. Respondents may choose a middle category in this situation because it involves less effort and may provide an option in the face of uncertainty. And consequently, three-category scales are less reliable than two- and four-category scales. Our results in the present study produce strong support for this hypothesis. Further, we found that the problem is somewhat mitigated in the use of five-catgory scales, which do not create the same kinds of problems; however, our results in the case of five-category scale are no more reliable than four-category versions that ignore the middle category. We expected four-category bipolar scales to be more reliable than five-category bipolar scales (a hypothesis that was supported), although the result was not statistically significant. In theory, the provision of weak positive and weak negative options may reduce the ambiguity between true neutrality and weak forms of positive and negative attitudes, but in terms of reliability, there is only weak evidence of it here. It is important to note, however, that several factors can affect the reliability of measurements and the generalizability of the results obtained in the study. In drawing our conclusions, we are mindful of some of the specific factors that may have affected the pattern of results obtained in the study (e.g., many of the seven-point scales in the GSS are not fully labeled, perhaps accounting for their comparatively lower reliabilities). We would note that, while there are several limitations to the present research insofar as estimating variations in the quality of the survey response, our results point toward simplicity in design. At the same time, there may be other factors at work. In particular, in order to focus on the response categories, we have chosen to ignore other aspects of survey questions that may be relevant to reliability of measurement. The omission of several known factors, such as the content being measured, the context in which they are located (e.g., with batteries), the labelling of response categories, or the length of the question and introductions to unit organizations, may limit the generality of the findings. Also, the questions from which we generalize are limited to those used in the GSS. Although these surveys span a wide range of content, there may be unique features of the organization collecting the data, specifically NORC, which may affect the level of measurement errors. In this case, given the stature of NORC as a data collection agency and the highly professional quality of NORC surveys, it is difficult to fault the results on quality grounds, but we must nonetheless caution that these reflect potential limitations in the generalizability of the results. Another limitation of this research is that, by focusing here solely on the polarity, the number of response categories, and middle alternatives in the case of bipolar questions, we did not control for other features of the “package” we consider to reflect the survey question (i.e., the content of the question, the response categories, the labelling of categories, and formal attributes of question). Although it is our ultimate objective to control for as many of these additional factors as we can, the scope of the present research was limited to response categories. We have tried to control for survey content, by analyzing the relationships within categories of content (ABVE and SPSA). Our hypothesis was that measurement errors may be less prevalent in self-reports (i.e., self-evaluations and self-descriptions), compared with reports about objects other than the self (i.e., objects of attitudes). Our results indicate that self-reports have the highest reliability, so it is important to consider this, particularly if certain types of measures tend to be used across categories of content. The measurement of expectations, that is, asking people to predict the future, is the most difficult, in terms of assessed reliabilities. It becomes important in this case to take into account the fact that many of our measures of expectations involve continuous probability scales. These and other issues will require further research on other pools of survey questions and comparison with other forms of analysis. Ultimately, we will be able to improve the quality of survey data by understanding the nature of the elements of the survey process that contribute to measurement error. Our results indicate that, contrary to information theory and findings in different contexts, the most reliable measurement results when fewer response categories are used, and generally, there is a monotonic decline in levels of reliability when greater numbers of response options are employed. There are important exceptions to pattern in that the use of middle categories results in more measurement errors, especially three-category measures. The overall conclusions of the research support the use of fewer response categories rather than more, encourages the use of unipolar scales, and discourages the use of middle categories in (bipolar) attitude measurement. Supplementary Materials Supplementary materials are available online at https://academic.oup.com/jssam. Footnotes 1 In the following discussion, for simplicity we use the term “attitude,” but it should be understood that we are referring to both attitudes and beliefs. In the most general sense, our focus is on subjective variables. 2 Note that the MTMM design permits the separation of random and systematic measurement error components (see Alwin 2011, 2015). 3 The investigation of attrition and the role of non-random processes of attrition that affect variance components has shown that there is substantial support for the MCAR (Missing Completely at Random) assumption (see Alwin 2007, pp. 137–46). References Allison P. D. ( 1987), “Estimation of Linear Models with Incomplete Data,” in Sociological Methodology, 1987 , vol. 17, ed. Clogg C. C., pp. 71– 103, Washington DC: American Sociological Association. Alwin D. F. ( 1989), “ Problems in the Estimation and Interpretation of the Reliability of Survey Data,” Quality and Quantity , 23, 277– 331. Google Scholar CrossRef Search ADS Alwin D. F. ( 1991), “ Research on Survey Quality,” Sociological Methods and Research , 20, 3– 29. Google Scholar CrossRef Search ADS Alwin D. F. ( 1992), “Information Transmission in the Survey Interview: Number of Response Categories and the Reliability of Attitude Measurement,” in Sociological Methodology 1992 , ed. Marsden P. V., pp. 83– 118, Washington DC: American Sociological Association. Alwin D. F. ( 2007), Margins of Error: A Study of Reliability in Survey Measurement , New York: John Wiley & Sons. Alwin D. F. ( 2010), “How Good is Survey Measurement? Assessing the Reliability and Validity of Survey Measures,” in Handbook of Survey Research , eds. Marsden P.V., Wright J.D., pp. 405– 434, Bingley, UK: Emerald Group Publishing. Alwin D. F. ( 2011), “Evaluating the Reliability and Validity of Survey Interview Data Using the MTMM Approach,” in Question Evaluation Methods—Contributing to the Science of Data Quality , eds. Madans J., Miller K., Maitland A., Willis G., pp. 265– 293, Hoboken, NJ: John Wiley & Sons, Inc. Alwin D. F. ( 2015), “New Approaches to Reliability and Validity Assessment,” on International Encyclopedia of the Social and Behavioral Sciences ( 2nd ed.), vol. 20, ed. Wright J. D., pp. 239– 247, Amsterdam, NL: Elsevier B.V. Ltd. Google Scholar CrossRef Search ADS Alwin D. F. ( 2016), “Survey Data Quality and Measurement Precision,” in SAGE Handbook of Survey Methodology , eds. Wolf C., Joye D., Smith T.W., Fu Y.-c., pp. 527–557, London: SAGE International Publishers Alwin D. F., Beattie B. A. ( 2016), “ The Kiss Principle in Survey Measurement—Question Length and Data Quality,” in Sociological Methodology , vol. 46 edited by D.F. Alwin, pp. 121–152, Thousand Oaks, CA: SAGE Publications. Alwin D. F., Krosnick J. A. ( 1991), “ The Reliability of Survey Attitude Measurement: The Influence of Question and Respondent Attributes,” Sociological Methods and Research , 20, 139– 181. Google Scholar CrossRef Search ADS Alwin D. F., Beattie B. A., Baumgartner E. M. ( 2015), “Assessing the Reliability of Measurement in the General Social Survey: The Content and Context of the GSS Survey Questions,” paper presented at the session on “Measurement Error and Questionnaire Design,” the 70th Annual Conference of the American Association for Public Opinion Research, Hollywood, FL, May. Alwin D. F., Zeiser K., Gensimore D. ( 2014), “ Reliability of Self-reports of Financial Data in Surveys: Results from the Health and Retirement Study,” Sociological Methods and Research , 43, 98– 136. Google Scholar CrossRef Search ADS Asparouhov T., Bength O., Muthén B. ( 2010), “Weighted Least Squares Estimation with Missing Data,” Mplus Technical Appendix. Retrieved from https://www.statmodel.com/download/GstrucMissingRevision.pdf. Accessed May 2, 2016. Bohrnstedt G. W., Mohler P. P., Müller W. ( 1987), “ An Empirical Study of the Reliability and Stability of Survey Research Items,” Sociological Methods and Research , 15, 171– 176. Google Scholar CrossRef Search ADS Brown R. L. ( 1989), “ Using Covariance Modeling for Estimating Reliability on Scales with Ordered Polytomous Variables,” Educational and Psychological Measurement , 49, 385– 398. Google Scholar CrossRef Search ADS Converse J. M., Presser S. ( 1986), Survey Questions: Handcrafting the Standardized Questionnaire , Beverly Hills, CA: Sage. Google Scholar CrossRef Search ADS Converse P. E. ( 1964), “The Nature of Belief Systems in the Mass Public,” in Ideology and Disconten , ed. Apter D. E., pp. 206– 261, New York: Free Press. General Social Survey ( 2011), “GSS Panel Data Release Notes,” Retrieved from: http://publicdata.norc.org:41000/gss/documents//OTHR/Release%20Notes%20for%20GSS%20Panel%2006W123%20R3.pdf. General Social Survey ( 2013), “GSS 2008 Sample Panel Wave 3, Release 1,” Retrieved from: http://publicdata.norc.org:41000/gss/documents//OTHR/Release%20Notes%20for%20GSS%20Panel%202008-sample.pdf. General Social Survey ( 2015), “Release Notes for GSS 2010-Sample Panel Wave 3, Release 1,” Retrieved from: http://gss.norc.org/documents/other/Release%20Notes%20for%20GSS%20Panel%202010W123%20R1.pdf. Heise D. R. ( 1969), “ Separating Reliability and Stability in Test-retest Correlation,” American Sociological Review , 34, 93– 191. Google Scholar CrossRef Search ADS Hout M., Hastings O. P.. ( 2016), “ Reliability of the Core Items in the General Social Survey: Estimates from the Three-Wave Panels, 2006-2014,” Sociological Science , 3, 971– 1002. Google Scholar CrossRef Search ADS Jabine T. B., Straf M. L., Tanur J. M., Tourangeau R. ( 1984), Cognitive Aspects of Survey Methodology: Building a Bridge Between Disciplines. Report of the Advanced Research Seminar on Cognitive Aspects of Survey Methodology , Washington DC: National Academy of Sciences Press. Jones L. V., Thurstone L. L. ( 1955), “ The Psychophysics of Semantics: An Experimental Investigation,” Journal of Applied Psychology , 39, 31– 36. Google Scholar CrossRef Search ADS Jöreskog K. G. ( 1970), “ Estimating and Testing of Simplex Models,” British Journal of Mathematical and Statistical Psychology , 23, 121– 145. Google Scholar CrossRef Search ADS Jöreskog K. G. ( 1990), “ New Developments in LISREL: Analysis of Ordinal Variables Using Polychoric Correlations and Weighted Least Squares,” Quality and Quantity , 24, 387– 404. Google Scholar CrossRef Search ADS Jöreskog K. G. ( 1994), “ On the Estimation of Polychoric Correlations and Their Asymptotic Covariance Matrix,” Psychometrika , 59, 381– 389. Google Scholar CrossRef Search ADS Kish L. ( 1982), “ Rensis Likert, 1903-1981,” The American Statistician , 32, 124– 125. Google Scholar CrossRef Search ADS Krosnick J. A., Alwin D. F. ( 1987), “ An Evaluation of a Cognitive Theory of Response-Order Effects in Survey Measurement,” Public Opinion Quarterly , 51, 201– 219. Google Scholar CrossRef Search ADS Krosnick J. A., Alwin D. F. ( 1988), “ A Test of the Form-Resistant Correlation Hypothesis: Ratings, Rankings, and the Measurement of Values,” Public Opinion Quarterly , 52, 526– 538. Google Scholar CrossRef Search ADS Krosnick J. A., Alwin D. F. ( 1989), “Response Strategies for Coping with the Cognitive Demands of Attitude Measures in Surveys,” GSS Methodological Report No. 46. General Social Survey, National Opinion Research Center, University of Chicago. [Also published as Krosnick, J. A. 1991. “Response Strategies for Coping with the Cognitive Demands of Attitude Measures in Surveys,” Applied Cognitive Psychology, 5,213–236.]. Krosnick J. A., Fabrigar L. R. ( 1997). “Designing Rating Scales for Effective Measurement in Surveys,” in Survey Measurement and Process Quality , eds. Lyberg L., Biemer P., Collins M., de Leeuw E., Dippo C., Schwarz N., Trewin D., pp. 141– 164, New York: John Wiley and Sons. Krosnick J. A., Presser S. ( 2010), “Question and Questionnaire Design,” in Handbook of Survey Research , eds. Marsden P.V., Wright J.D., pp. 263– 313, Bingley, UK: Emerald Group Publishing. Lee S.-Y., Poon W.-Y., Bentler P. M. ( 1990), “ A Three-Stage Estimation Procedure for Structural Equation Models with Polytomous Variables,” Psychometrika , 55, 45– 51. Google Scholar CrossRef Search ADS Likert R. ( 1932), “ A Technique for the Measurement of Attitudes,” Archives of General Psychology , 140, 5– 55. Lord F. M., Novick M. L. ( 1968), Statistical Theories of Mental Test Scores , Reading, MA: Addison-Wesley. Madans J., Miller K., Maitland A., Willis G. (eds.) ( 2011), Question Evaluation Methods—Contributing to the Science of Data Quality , Hoboken, NJ: Wiley. Google Scholar CrossRef Search ADS Maitland A. ( 2009), “ How Many Scale Points Should I Include for Attitudinal Questions?,” Survey Practice , 2(5) 1– 4. Retrieved from http://surveypractice.org/index.php/SurveyPractice/article/view/179;html. Accessed August 14, 2017. McKennell A. ( 1973), “ Surveying Attitude Structures,” Quality and Quantity , 7, 203– 296. Google Scholar CrossRef Search ADS Moser C. A., Kalton G. ( 1972), Survey Methods of Social Investigation , New York: Basic Books. Muthén B. O. ( 1984), “ A General Structural Equation Model with Dichotomous, Ordered Categorical, and Continuous Latent Variable Indicators,” Psychometrika , 49, 115– 132. Google Scholar CrossRef Search ADS Rost J. ( 1988), “ Measuring Attitudes with a Threshold Model Drawing on a Traditional Scaling Concept,” Applied Psychological Measurement , 12, 397– 409. Google Scholar CrossRef Search ADS Revilla M., Saris W. E., Krosnick J. A. ( 2014), “ Choosing the Number of Categories in Agree-Disagree Scales,” Sociological Methods & Research , 43, 73– 97. Google Scholar CrossRef Search ADS Saris W. E., Andrews F. M. ( 1991), “Evaluation of Measurement Instruments Using a Structural Modeling Approach,” in Measurement Errors in Surveys , eds. Biemer P.B., Groves R.M., Lyberg L.E., Mathiowetz N.A., Sudman S., pp. 575– 597, New York: John Wiley & Sons. Saris W. E., Gallhofer I. N. ( 2007), Design, Evaluation, and Analysis of Questionnaires for Survey Research , New York: John Wiley & Sons. Google Scholar CrossRef Search ADS Schaeffer N. C., Dykema J. ( 2011), “ Questions for Surveys: Current Trends and Future Directions,” Public Opinion Quarterly , 75, 909– 961. Google Scholar CrossRef Search ADS Schaeffer N. C., Presser S. ( 2003), “ The Science of Asking Questions,” Annual Review of Sociology , 29, 65– 88. Google Scholar CrossRef Search ADS Schuessler K. ( 1971), Analyzing Social Data—A Statistical Orientation , New York: Houghton-Mifflin. Schuman H., Presser S. ( 1981), Questions and Answers: Experiments in Question Wording, Form and Context , New York: Academic Press. Sirken M. G., Herrmann D. J., Schechter S., Schwarz N., Tanur J. M., Tourangeau R. ( 1999), Cognition and Survey Research , New York: John Wiley and Sons. Smith T. W., Marsden P. V., Hout M. ( 2015). General Social Surveys, 1972-2014. [machine-readable data file]. Principal Investigator, Tom W. Smith; Co-Principal Investigators, Peter V. Marsden and Michael Hout, NORC ed. Chicago: National Opinion Research Center. 1 data file (59,599logicalrecords) and 1 codebook (3,485pp). Sturgis P., Roberts C., Smith P. ( 2014), “ Middle Alternatives Revisited: How the Neither/Nor Response Acts as a Way of Saying ‘I Don’t Know,” Sociological Methods and Research , 43, 15– 38. Google Scholar CrossRef Search ADS Sudman S., Bradburn N. M. ( 1974), Response Effects in Surveys , Chicago, IL: Aldine. Sudman S., Bradburn N. M. ( 1982), Asking Questions: A Practical Guide to Questionnaire Design , San Francisco, CA: Jossey-Bass. Sudman S., Bradburn N. M., Schwarz N. ( 1996), Thinking About Answers: The Application of Cognitive Processes to Survey Methodology , San Francisco, CA: Jossey-Bass. Tanur J.M. (ed.) ( 1992), Questions about Questions—Inquiries into the Cognitive Bases of Surveys , New York: Russell Sage Foundation. Thurstone L. L. ( 1927), “ A Law of Comparative Judgment,” Psychological Review , 34: 273– 286. [see Psychological Review, 1994, 101:266–270]. Google Scholar CrossRef Search ADS Tourangeau R., Rips L. J., Rasinski K. ( 2000), The Psychology of Survey Response , Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS Turner C. F., Martin E. ( 1984), Surveying Subjective Phenomena , Vol. 1. New York, NY: Russell Sage. Uebersax J. S. ( 2006), “Likert Scales: Dispelling the Confusion,” Statistical Methods for Rater Agreement, Available at http://ourworld.compuserve.com/hompages/jsuebersax/likert2.htm (accessed February 20, 2008). Wiggins L. M. ( 1973), Panel Analysis: Latent Probability Models for Attitude and Behavior Processes , New York: Elsevier Scientific Publishing Company. Wiley D. E., Wiley J. A. ( 1970), “ The Estimation of Measurement Error in Panel Data,” American Sociological Review , 35, 112– 117. Google Scholar CrossRef Search ADS Wittenbrink B., Schwarz N. (eds.) ( 2007), Implicit Measures of Attitudes , New York, NY: Guilford Press. Wothke W. ( 2000), “Longitudinal and Multigroup Modeling with Missing Data,” in Modeling Longitudinal and Multilevel Data: Practical Issues, Applied Approaches and Specific Examples , eds. Little T. D., Schnabel K. U., Baumert J., pp. 219– 240, Mahwah, NJ: Lawrence Erlbaum Associates, Publishers. © The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For permissions, please email: firstname.lastname@example.org This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
Journal of Survey Statistics and Methodology – Oxford University Press
Published: Sep 23, 2017
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera