Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer

Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research:... REVIEW published: 11 June 2018 doi: 10.3389/fpubh.2018.00149 Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer 1 2 3 Godfred O. Boateng *, Torsten B. Neilands , Edward A. Frongillo , 4 1,5 Hugo R. Melgar-Quiñonez and Sera L. Young 1 2 Department of Anthropology and Global Health, Northwestern University, Evanston, IL, United States, Division of Prevention Science, Department of Medicine, University of California, San Francisco, San Francisco, CA, United States, Department of Health Promotion, Education and Behavior, Arnold School of Public Health, University of South Carolina, Columbia, SC, United States, Institute for Global Food Security, School of Human Nutrition, McGill University, Montreal, QC, Canada, Institute for Policy Research, Northwestern University, Evanston, IL, United States Scale development and validation are critical to much of the work in the health, social, and behavioral sciences. However, the constellation of techniques required for scale development and evaluation can be onerous, jargon-filled, unfamiliar, and resource-intensive. Further, it is often not a part of graduate training. Therefore, our goal was to concisely review the process of scale development in as straightforward a manner as possible, both to facilitate the development of new, valid, and reliable Edited by: Jimmy Thomas Efird, scales, and to help improve existing ones. To do this, we have created a primer for University of Newcastle, Australia best practices for scale development in measuring complex phenomena. This is not Reviewed by: a systematic review, but rather the amalgamation of technical literature and lessons Aida Turrini, Consiglio per la Ricerca in Agricoltura learned from our experiences spent creating or adapting a number of scales over the e L’analisi Dell’Economia Agraria past several decades. We identified three phases that span nine steps. In the first phase, (CREA), Italy items are generated and the validity of their content is assessed. In the second phase, Mary Evelyn Northridge, New York University, United States the scale is constructed. Steps in scale construction include pre-testing the questions, *Correspondence: administering the survey, reducing the number of items, and understanding how many Godfred O. Boateng factors the scale captures. In the third phase, scale evaluation, the number of dimensions [email protected] is tested, reliability is tested, and validity is assessed. We have also added examples of Specialty section: best practices to each step. In sum, this primer will equip both scientists and practitioners This article was submitted to to understand the ontology and methodology of scale development and validation, Epidemiology, thereby facilitating the advancement of our understanding of a range of health, social, a section of the journal Frontiers in Public Health and behavioral outcomes. Received: 26 February 2018 Keywords: scale development, psychometric evaluation, content validity, item reduction, factor analysis, tests of Accepted: 02 May 2018 dimensionality, tests of reliability, tests of validity Published: 11 June 2018 Citation: Boateng GO, Neilands TB, INTRODUCTION Frongillo EA, Melgar-Quiñonez HR and Young SL (2018) Best Practices for Scales are a manifestation of latent constructs; they measure behaviors, attitudes, and hypothetical Developing and Validating Scales for scenarios we expect to exist as a result of our theoretical understanding of the world, but cannot Health, Social, and Behavioral assess directly (1). Scales are typically used to capture a behavior, a feeling, or an action that cannot Research: A Primer. be captured in a single variable or item. The use of multiple items to measure an underlying Front. Public Health 6:149. doi: 10.3389/fpubh.2018.00149 latent construct can additionally account for, and isolate, item-specific measurement error, which Frontiers in Public Health | www.frontiersin.org 1 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation leads to more accurate research findings. Thousands of scales have been developed that can measure a range of social, psychological, and health behaviors and experiences. As science advances and novel research questions are put forth, new scales become necessary. Scale development is not, however, an obvious or a straightforward endeavor. There are many steps to scale development, there is significant jargon within these techniques, the work can be costly and time consuming, and complex statistical analysis is often required. Further, many health and behavioral science degrees do not include training on scale development. Despite the availability of a large amount of technical literature on scale theory and development (1–7), there are a number of incomplete scales used to measure mental, physical, and behavioral attributes that are fundamental to our scientific inquiry (8, 9). Therefore, our goal is to describe the process for scale development in as straightforward a manner as possible, both to facilitate the development of new, valid, and reliable scales, and to help improve existing ones. To do this, we have created a primer for best practices for scale development. We anticipate this primer will be broadly applicable across many disciplines, especially for health, social, and behavioral sciences. This is not a systematic review, but rather the amalgamation of technical literature and lessons learned from our experiences spent creating or adapting a number of scales related to multiple disciplines (10–23). First, we provide an overview of each of the nine steps. Then, within each step, we define key concepts, describe the tasks required to achieve that step, share common pitfalls, and draw on examples in the health, social, and behavioral sciences to recommend best practices. We have tried to keep the material as straightforward as possible; references to the body of technical work have been the foundation of this primer. SCALE DEVELOPMENT OVERVIEW There are three phases to creating a rigorous scale—item development, scale development, and scale evaluation (24); these can be further broken down into nine steps (Figure 1). Item development, i.e., coming up with the initial set of questions for an eventual scale, is composed of: (1) identification of the domain(s) and item generation, and (2) consideration of content validity. The second phase, scale development, i.e., Abbreviations: A-CASI, audio computer self-assisted interviewing; ASES, adherence self-efficacy scale; CAPI, computer assisted personal interviewing; CFA, confirmatory factor analysis; CASIC, computer assisted survey information collection builder; CFI, comparative fit index; CTT, classical test theory; DIF, differential item functioning; EFA, exploratory factor analysis; FIML, full FIGURE 1 | An overview of the three phases and nine steps of scale information maximum likelihood; FNE, fear of negative evaluation; G, global development and validation. factor; ICC, intraclass correlation coefficient; ICM, Independent cluster model; IRT, item response theory; ODK, Open Data Kit; PAPI, paper and pen/pencil interviewing; QDS, Questionnaire Development System; RMSEA, root mean square error of approximation; SAD, social avoidance and distress; SAS, statistical turning individual items into a harmonious and measuring analysis systems; SASC-R, social anxiety scale for children revised; SEM, structural construct, consists of (3) pre-testing questions, (4) sampling and equation model; SPSS, statistical package for the social sciences; Stata, statistics survey administration, (5) item reduction, and (6) extraction of and data; SRMR, standardized root mean square residual of approximation; TLI, latent factors. The last phase, scale evaluation, requires: (7) tests Tucker Lewis Index; WASH, water, sanitation, and hygiene; WRMR, weighted root mean square residual. of dimensionality, (8) tests of reliability, and (9) tests of validity. Frontiers in Public Health | www.frontiersin.org 2 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation TABLE 1 | The three phases and nine steps of scale development and validation. Activity Purpose How to explore or estimate? References PHASE 1: ITEM DEVELOPMENT Step 1: Identification of Domain and Item Generation: Selecting Which Items to Ask Domain To specify the boundaries of the domain and facilitate 1.1 Specify the purpose of the domain (1–4), (25) identification item generation 1.2 Confirm that there are no existing instruments 1.3 Describe the domain and provide preliminary conceptual definition 1.4 Specify the dimensions of the domain if they exist a priori 1.5 Define each dimension Item generation To identify appropriate questions that fit the identified 1.6 Deductive methods: literature review and assessment of (2–5), (24–41) domain existing scales 1.7 Inductive methods: exploratory research methodologies including focus group discussions and interviews Step 2: Content Validity: Assessing if the Items Adequately Measure the Domain of Interest Evaluation by To evaluate each of the items constituting the domain for 2.1 Quantify assessments of 5-7 expert judges using formalized (1–5), experts content relevance, representativeness, and technical scaling and statistical procedures including content validity (24, 42–48) quality ratio, content validity index, or Cohen’s coefficient alpha 2.2 Conduct Delphi method with expert judges Evaluation by To evaluate each item constituting the domain for 2.3 Conduct cognitive interviews with end users of scale items to (20, 25) target population representativeness of actual experience from target evaluate face validity population PHASE 2: SCALE DEVELOPMENT Step 3: Pre-testing Questions: Ensuring the Questions and Answers Are Meaningful Cognitive To assess the extent to which questions reflect the 3.1 Administer draft questions to 5–15 interviewees in 2–3 rounds (49–54) interviews domain of interest and that answers produce valid while allowing respondents to verbalize the mental process measurements entailed in providing answers Step 4: Survey Administration and Sample Size: Gathering Enough Data from the Right People Survey To collect data with minimum measurement errors 4.1 Administer potential scale items on a sample that reflects (55–58) administration range of target population using paper or device Establishing the To ensure the availability of sufficient data for scale 4.2 Recommended sample size is 10 respondents per survey (29, 59–65) sample size development item and/or 200-300 observations Determining the To ensure the availability of data for scale development 4.3 Use cross-sectional data for exploratory factor analysis – type of data to use and validation 4.4 Use data from a second time point, at least 3 months later in a longitudinal dataset, or an independent sample for test of dimensionality (Step 7) Step 5: Item Reduction: Ensuring Your Scale Is Parsimonious Item difficulty index To determine the proportion of correct answers given per 5.1 Proportion can be calculated for CTT and item difficulty (1, 2, 66–68) item (CTT) parameter estimated for IRT using statistical packages To determine the probability of a particular examinee correctly answering a given item (IRT) (Continued) Frontiers in Public Health | www.frontiersin.org 3 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation TABLE 1 | Continued Activity Purpose How to explore or estimate? References Item discrimination To determine the degree to which an item or set of test 5.2 Estimate biserial correlations or item discrimination parameter (69–75) test questions are measuring a unitary attribute (CTT) using statistical packages To determine how steeply the probability of correct response changes as ability increases (IRT) Inter-item and To determine the correlations between scale items, as 5.3 Estimate inter-item/item communalities, item-total, and (1, 2, 68, 76) item-total well as the correlations between each item and sum adjusted item-total correlations using statistical packages correlations score of scale items Distractor To determine the distribution of incorrect options and 5.4 Estimate distractor analysis using statistical packages (77–80) efficiency analysis how they contribute to the quality of items Deleting or To ensure the availability of complete cases for scale 5.5 Delete items with many cases that are permanently missing, (81–84) imputing missing development or use multiple imputation or full information maximum cases likelihood for imputation of data Step 6: Extraction of Factors: Exploring the Number of Latent Constructs that Fit Your Observed Data Factor analysis To determine the optimal number of factors or domains 6.1 Use scree plots, exploratory factor analysis, parallel analysis, (2–4), (85–90) that fit a set of items minimum average partial procedure, and/or the Hull method PHASE 3: SCALE EVALUTION Step 7: Tests of Dimensionality: Testing if Latent Constructs Are as Hypothesized Test dimensionality To address queries on the latent structure of scale items 7.1 Estimate independent cluster model—confirmatory factor (91–114) and their underlying relationships. i.e., to validate analysis, cf. Table 2 whether the previous hypothetical structure fits the items 7.2 Estimate bifactor models to eliminate ambiguity about the type of dimensionality—unidimensionality, bidimensionality, or multi-dimensionality 7.3 Estimate measurement invariance to determine whether hypothesized factor and dimension is congruent across groups or multiple samples Score scale items To create scale scores for substantive analysis including 7.4. calculate scale scores using an unweighted approach, which (115) reliability and validity of scale includes summing standardized item scores and raw item scores, or computing the mean for raw item scores 7.5. Calculate scale scores by using a weighted approach, which includes creating factor scores via confirmatory factor analysis or structural equation models Step 8: Tests of Reliability: Establishing if Responses Are Consistent When Repeated Calculate reliability To assess the internal consistency of the scale. i.e., the 8.1 Estimate using Cronbach’s alpha (116–123) statistics degree to which the set of items in the scale co-vary, 8.2. Other tests such as Raykov’s rho, ordinal alpha, and Revelle’s relative to their sum score beta can be used to assess scale reliability Test–retest To assess the degree to which the participant’s 8.3 Estimate the strength of the relationship between scale items (1, 2, 124, reliability performance is repeatable; i.e., how consistent their over two or three time points; variety of measures possible 125) scores are across time Step 9: Tests of Validity: Ensuring You Measure the Latent Dimension You Intended Criterion validity Predictive validity To determine if scores predict future outcomes 9.1 Use bivariate and multivariable regression; stronger and (1, 2, 31) significant associations or causal effects suggest greater predictive validity (Continued) Frontiers in Public Health | www.frontiersin.org 4 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation TABLE 1 | Continued Activity Purpose How to explore or estimate? References Concurrent validity To determine the extent to which scale scores have a 9.2 Estimate the association between scale scores and “gold (2) stronger relationship with criterion measurements made standard” of scale measurement; stronger significant near the time of administration association in Pearson product-moment correlation suggests support for concurrent validity Construct validity Convergent validity To examine if the same concept measured in different 9.3 Estimate the relationship between scale scores and similar (2, 37, 126) ways yields similar results constructs using multi-trait multi-method matrix, latent variable modeling, or Pearson product-moment coefficient; higher/stronger correlation coefficients suggest support for convergent validity Discriminant To examine if the concept measured is different from 9.4 Estimate the relationship between scale scores and distinct (2, 37, 126) validity some other concept constructs using multi-trait multi-method matrix, latent variable modeling, or Pearson product-moment coefficient; lower/weaker correlation coefficients suggest support for discriminant validity Differentiation by To examine if the concept measured behaves as 9.5 Select known binary variables based on theoretical and (2, 126) “known groups” expected in relation to “known groups” empirical knowledge and determine the distribution of the scale scores over the known groups; use t-tests if binary, ANOVA if multiple groups Correlation To determine the relationship between existing measures 9.6 Correlate scale scores and existing measures or, preferably, (2, 127, 128) analysis or variables and newly developed scale scores use linear regression, intraclass correlation coefficient, and analysis of standard deviations of the differences between scores PHASE 1: ITEM DEVELOPMENT Item Generation Once the domain is delineated, the item pool can then be Step 1: Identification of the Domain(s) and identified. This process is also called “question development” Item Generation (26) or “item generation” (24). There are two ways to identify Domain Identification appropriate questions: deductive and inductive methods (24). The first step is to articulate the domain(s) that you are The deductive method, also known as “logical partitioning” endeavoring to measure. A domain or construct refers to the or “classification from above” (27) is based on the description of concept, attribute, or unobserved behavior that is the target of the relevant domain and the identification of items. This can be the study (25). Therefore, the domain being examined should done through literature review and assessment of existing scales be decided upon and defined before any item activity (2). A and indicators of that domain (2, 24). The inductive method, well-defined domain will provide a working knowledge of the also known as “grouping” or “classification from below” (24, 27) phenomenon under study, specify the boundaries of the domain, involves the generation of items from the responses of individuals and ease the process of item generation and content validation. (24). Qualitative data obtained through direct observations and McCoach et al. outline a number of steps in scale development; exploratory research methodologies, such as focus groups and we find the first five to be suitable for the identification of individual interviews, can be used to inductively identify domain domain (4). These are all based on thorough literature review and items (5). include (a) specifying the purpose of the domain or construct It is considered best practice to combine both deductive and you seek to develop, and (b), confirming that there are no inductive methods to both define the domain and identify the existing instruments that will adequately serve the same purpose. questions to assess it. While the literature review provides the Where there is a similar instrument in existence, you need to theoretical basis for defining the domain, the use of qualitative justify why the development of a new instrument is appropriate techniques moves the domain from an abstract point to the and how it will differ from existing instruments. Then, (c) identification of its manifest forms. A scale or construct defined describe the domain and provide a preliminary conceptual by theoretical underpinnings is better placed to make specific definition and (d) specify, if any, the dimensions of the domain. pragmatic decisions about the domain (28), as the construct will Alternatively, you can let the number of dimensions forming be based on accumulated knowledge of existing items. the domain to be determined through statistical computation It is recommended that the items identified using deductive (cf. Steps 5, 6, and 7). Domains are determined a priori if there and inductive approaches should be broader and more is an established framework or theory guiding the study, but a comprehensive than one’s own theoretical view of the target posteriori if none exist. Finally, if domains are identified a priori, (28, 29). Further, content should be included that ultimately will (e) the final conceptual definition for each domain should be be shown to be tangential or unrelated to the core construct. specified. In other words, one should not hesitate to have items on the Frontiers in Public Health | www.frontiersin.org 5 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation scale that do not perfectly fit the domain identified, as successive than five categories are best estimated using robust categorical evaluation will eliminate undesirable items from the initial pool. methods. However, items with five to seven categories without Kline and Schinka et al. note that the initial pool of items strong floor or ceiling effects can be treated as continuous items developed should be at minimum twice as long as the desired final in confirmatory factor analysis and structural equation modeling scale (26, 30). Others have recommended the initial pool to be five using maximum likelihood estimations (34). times as large as the final version, to provide the requisite margin One pitfall in the identification of domain and item to select an optimum combination of items (30). We agree with generation is the improper conceptualization and definition of Kline and Schinka et al. (26, 30) that the number of items should the domain(s). This can result in scales that may either be be at least twice as long as the desired scale. deficient because the definition of the domain is ambiguous Further, in the development of items, the form of the items, or has been inadequately defined (35). It can also result in the wording of the items, and the types of responses that the contamination, i.e., the definition of the domain overlaps with question is designed to induce should be taken into account. other existing constructs in the same field (35). It also means questions should capture the lived experiences Caution should also be taken to avoid construct of the phenomenon by target population (30). Further, items underrepresentation, which is when a scale does not capture should be worded simply and unambiguously. Items should not important aspects of a construct because its focus is too narrow be offensive or potentially biased in terms of social identity, (35, 36). Further, construct-irrelevant variance, which is the i.e., gender, religion, ethnicity, race, economic status, or sexual degree to which test scores are influenced by processes that have orientation (30). little to do with the intended construct and seem to be widely Fowler identified five essential characteristics of items inclusive of non-related items (36, 37), should be avoided. Both required to ensure the quality of construct measurement (31). construct underrepresentation and irrelevant variance can lead These include (a) the need for items to be consistently to the invalidation of the scale (36). understood; (b) the need for items to be consistently An example of best practice using the deductive approach to administered or communicated to respondents; (c) the consistent item generation is found in the work of Dennis on breastfeeding communication of what constitutes an adequate answer; (d) self-efficacy (38–40). Dennis’ breastfeeding self-efficacy scale the need for all respondents to have access to the information items were first informed by Bandura’s theory on self-efficacy, needed to answer the question accurately; and (e) the willingness followed by content analysis of literature review, and empirical for respondents to provide the correct answers required by the studies on breastfeeding-related confidence. question at all times. A valuable example for a rigorous inductive approach is found These essentials are sometimes very difficult to achieve. in the work of Frongillo and Nanama on the development and Krosnick (32) suggests that respondents can be less thoughtful validation of an experience-based measure of household food about the meaning of a question, search their memories less insecurity in northern Burkina Faso (41). In order to generate comprehensively, integrate retrieved information less carefully, items for the measure, they undertook in-depth interviews with or even select a less precise response choice. All this means 10 household heads and 26 women using interview guides. The that they are merely satisficing, i.e., providing merely satisfactory data from these interviews were thematically analyzed, with the answers, rather than the most accurate ones. In order to combat results informing the identification of items to be added or this behavior, questions should be kept simple, straightforward, deleted from the initial questionnaire. Also, the interviews led to and should follow the conventions of normal conversation. the development and revision of answer choices. With regards to the type of responses to these questions, we recommend that questions with dichotomous response Step 2: Content Validity categories (e.g., true/false) should have no ambiguity. When a Content validity, also known as “theoretical analysis” (5), refers Likert-type response scale is used, the points on the scale should to the “adequacy with which a measure assesses the domain reflect the entire measurement continuum. Responses should of interest” (24). The need for content adequacy is vital if the be presented in an ordinal manner, i.e., in an ascending order items are to measure what they are presumed to measure (1). without any overlap, and each point on the response scale should Additionally, content validity specifies content relevance and be meaningful and interpreted the same way by each participant content representations, i.e., that the items capture the relevant to ensure data quality (33). experience of the target population being examined (129). In terms of the number of points on the response scale, Content validity entails the process of ensuring that only Krosnick and Presser (33) showed that responses with just two the phenomenon spelled out in the conceptual definition, but to three points have lower reliability than Likert-type response not other aspects that “might be related but are outside the scales with five to seven points. However, the gain levels off investigator’s intent for that particular [construct] are added” after seven points. Therefore, response scales with five points are (1). Guion has proposed five conditions that must be satisfied recommended for unipolar items, i.e., those reflecting relative in order for one to claim any form of content validity. We find degrees of a single item response quality, e.g., not at all satisfied to these conditions to be broadly applicable to scale development very satisfied. Seven response items are recommended for bipolar in any discipline. These include that (a) the behavioral content items, i.e., those reflecting relative degrees of two qualities of an has a generally accepted meaning or definition; (b) the domain item response scale, e.g., completely dissatisfied to completely is unambiguously defined; (c) the content domain is relevant to satisfied. As an analytic aside, items with scale points fewer the purposes of measurement; (d) qualified judges agree that the Frontiers in Public Health | www.frontiersin.org 6 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation domain has been adequately sampled based on consensus; and An example of the concurrent use of expert and target (e) the response content must be reliably observed and evaluated population judges comes from Boateng et al.’s work to develop (42). Therefore, content validity requires evidence of content a household-level water insecurity scale appropriate for use in relevance, representativeness, and technical quality. western Kenya (20). We used the Delphi method to obtain three Content validity is mainly assessed through evaluation by rounds of feedback from international experts including those expert and target population judges. in hydrology, geography, WASH and water-related programs, policy implementation, and food insecurity. Each of the three rounds was interspersed with focus group discussions with our Evaluation by Experts target population, i.e., people living in western Kenya. In each Expert judges are highly knowledgeable about the domain of round, the questionnaires progressively became more closed interest and/or scale development; target population judges are ended, until consensus was attained on the definition of the potential users of the scale (1, 5). Expert judges seem to be used domain we were studying and possible items we could use. more often than target-population judges in scale development work to date. Ideally, one should combine expert and target population judgment. When resources are constrained, however, PHASE 2: SCALE DEVELOPMENT we recommend at least the use of expert judges. Expert judges evaluate each of the items to determine whether Step 3: Pre-testing Questions they represent the domain of interest. These expert judges Pre-testing helps to ensure that items are meaningful to the should be independent of those who developed the item pool. target population before the survey is actually administered, i.e., Expert judgment can be done systematically to avoid bias in the it minimizes misunderstanding and subsequent measurement assessment of items. Multiple judges have been used (typically error. Because pre-testing eliminates poorly worded items and ranging from 5 to 7) (25). Their assessments have been quantified facilitates revision of phrasing to be maximally understood, it also using formalized scaling and statistical procedures such as the serves to reduce the cognitive burden on research participants. content validity ratio for quantifying consensus (43), content Finally, pre-testing represents an additional way in which validity index for measuring proportional agreement (44), or members of the target population can participate in the research Cohen’s coefficient kappa (k) for measuring inter-rater or expert process by contributing their insights to the development of the agreement (45). Among the three procedures, we recommend survey. Cohen’s coefficient kappa, which has been found to be most Pre-testing has two components: the first is the examination efficient (46). Additionally, an increase in the number of experts of the extent to which the questions reflect the domain has been found to increase the robustness of the ratings (25, 44). being studied. The second is the examination of the extent Another way by which content validity can be assessed to which answers to the questions asked produce valid through expert judges is by using the Delphi method to come to a measurements (31). consensus on which questions are a reflection of the construct you want to measure. The Delphi method is a technique “for Cognitive Interviews structuring group communication process so that the process is To evaluate whether the questions reflect the domain of effective in allowing a group of individuals, as a whole, to deal study and meet the requisite standards, techniques including with a complex problem” (47). cognitive interviews, focus group discussion, and field pre-testing A good example of evaluation of content validity using expert under realistic conditions can be used. We describe the most judges is seen in the work of Augustine et al. on adolescent recommended, which is cognitive interviews. knowledge of micronutrients (48). After identifying a list of items Cognitive interviewing entails the administration of draft to be validated, the authors consulted experts in the field of survey questions to target populations and then asking the nutrition, psychology, medicine, and basic sciences. The items respondents to verbalize the mental process entailed in providing were then subjected to content analysis using expert judges. Two such answers (49). Generally, cognitive interviews allow for independent reviews were carried out by a panel of five experts questions to be modified, clarified, or augmented to fit the to select the questions that were appropriate, accurate, and objectives of the study. This approach helps to determine whether interpretable. Items were either accepted, rejected, or modified the question is generating the information that the author intends based on majority opinion (48). by helping to ensure that respondents understand questions as developers intended and that respondents are able to answer Evaluation by Target Population in a manner that reflects their experience (49, 50). This can Target population judges are experts at evaluating face validity, be done on a sample outside of the study population or on a which is a component of content validity (25). Face validity is subset of study participants, but it must be explored before the the “degree that respondents or end users [or lay persons] judge questionnaire is finalized (51, 52). that the items of an assessment instrument are appropriate to the The sample used for cognitive interviewing should capture the targeted construct and assessment objectives” (25). These end- range of demographics you anticipate surveying (49). A range users are able to tell whether the construct is a good measure of 5–15 interviews in two to three rounds, or until saturation, of the domain through cognitive interviews, which we discuss in or relatively few new insights emerge is considered ideal for Step 3. pre-testing (49, 51, 52). Frontiers in Public Health | www.frontiersin.org 7 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation In sum, cognitive interviews get to the heart of both assessing as sexual behaviors and substance use, compared to when being the appropriateness of the question to the target population asked by another human. and the strength of the responses (49). The advantages of On the other hand, paper forms may avert the crisis of losing using cognitive interviewing include: (a) it ensures questions are data if the software crashes, the devices are lost or stolen prior producing the intended data, (b) questions that are confusing to being backed up, and may be more suitable in areas that to participants are identified and improved for clarity, (c) have irregular electricity and/or internet. However, as sample problematic questions or questions that are difficult to answer sizes increase, the use of PAPI becomes more expensive, time are identified, (d) it ensures response options are appropriate and labor intensive, and the data are exposed in several ways to and adequate, (e) it reveals the thought process of participants human error (57, 58). Based on the merits of CAPI over PAPI, we on domain items, and (f) it can indicate problematic question recommend researchers use CAPI in data collection for surveys order (52, 53). Outcomes of cognitive interviews should when feasible. always be reported, along with solutions used to remedy the situation. Establishing the Sample Size An example of best practice in pre-testing is seen in the work The sample size to use for the development of a latent construct of Morris et al. (54). They developed and validated a novel scale has often been contentious. It is recommended that potential for measuring interpersonal factors underlying injection drug use scale items be tested on a heterogeneous sample, i.e., a sample that behaviors among injecting partners. After item development and both reflects and captures the range of the target population (29). expert judgment, they conducted cognitive interviews with seven For example, when the scale is used in a clinical setting, Clark and respondents with similar characteristics to the target population Watson recommend using patient samples early on instead of a to refine and assess item interpretation and to finalize item sample from the general population (29). structure. Eight items were dropped after cognitive interviews for The necessary sample size is dependent on several aspects lack of clarity or importance. They also made modifications to of any given study, including the level of variation between the grammar, word choice, and answer options based on the feedback variables, and the level of over-determination (i.e., the ratio of from cognitive interviews. variables to number of factors) of the factors (59). The rule of thumb has been at least 10 participants for each scale item, i.e., an ideal ratio of respondents to items is 10:1 (60). However, others Step 4: Survey Administration and Sample have suggested sample sizes that are independent of the number Size of survey items. Clark and Watson (29) propose using 300 Survey Administration respondents after initial pre-testing. Others have recommended Collecting data with minimum measurement errors from a range of 200–300 as appropriate for factor analysis (61, 62). an adequate sample size is imperative. These data can be Based on their simulation study using different sample sizes, collected using paper and pen/pencil interviewing (PAPI) or Guadagnoli and Velicer (61) suggested that a minimum of Computer Assisted Personal Interviewing (CAPI) on devices 300–450 is required to observe an acceptable comparability of like laptops, tablets, or phones. A number of software patterns, and that replication is required if the sample size is programs exist for building forms on devices. These include <300. Comrey and Lee suggest a graded scale of sample sizes for Computer Assisted Survey Information Collection (CASIC) scale development: 100 = poor, 200 = fair, 300 = good, 500 = TM Builder (West Portal Software Corporation, San Francisco, very good, ≥1,000 = excellent (63). Additionally, item reduction TM CA); Qualtrics Research Core (www.qualtrics.com); Open procedures (described, below in Step 5), such as parallel analysis Data Kit (ODK, https://opendatakit.org/); Research Electronic which requires bootstrapping (estimating statistical parameters Data Capture (REDCap) (55); SurveyCTO (Dobility, Inc. from sample by means of resampling with replacement) (64), https://www.surveycto.com); and Questionnaire Development may require larger data sets. TM System (QDS, www.novaresearch.com), which allows the In sum, there is no single item-ratio that works for all survey participant to report sensitive audio data. development scenarios. A larger sample size or respondent: Each approach has advantages and drawbacks. Using item ratio is always better, since a larger sample size implies technology can reduce the errors associated with data entry, lower measurement errors and more, stable factor loadings, allow the collection of data from large samples with minimal replicable factors, and generalizable results to the true population cost, increase response rate, reduce enumerator errors, permit structure (59, 65). A smaller sample size or respondent: item ratio instant feedback, and increase monitoring of data collection may mean more unstable loadings and factors, random, non- and ability to get more confidential data (56–58, 130). A subset replicable factors, and non-generalizable results (59, 65). Sample of technology-based programs offers the option of attaching size is, however, always constrained by resources available, and audio files to the survey questions so that questions may be more often than not, scale development can be difficult to fund. recorded and read out loud to participants with low literacy via audio computer self-assisted interviewing (A-CASI) (131). Determining the Type of Data to Use Self-interviewing, whether via A-CASI or via computer-assisted The development of a scale minimally requires data from a single personal interviewing, in which participants read and respond point in time. To fully test for the reliability of the scale (cf. Steps to questions on a computer without interviewer involvement, 8, 9), however, either an independent dataset or a subsequent may increase reports of sensitive or stigmatized behaviors such time point is necessary. Data from longitudinal studies can be Frontiers in Public Health | www.frontiersin.org 8 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation used for initial scale development (e.g., from baseline), to conduct inter-item and item-total correlations, which are mostly used for confirmatory factor analysis (using follow-up data, cf. Step 7), categorical items; and distractor efficiency analysis for items with and to assess test–retest reliability (using baseline and follow- multiple choice response options (1, 2). up data). The problem with using longitudinal data to test Item Difficulty Index hypothesized latent structures is common error variance, since the same, potentially idiosyncratic, participants will be involved. The item difficulty index is both a CTT and an IRT parameter that To give the most credence to the reliability of scale, the ideal can be traced largely to educational and psychological testing to procedure is to develop the scale on sample A, whether cross- assess the relative difficulties and discrimination abilities of test sectional or longitudinal, and then test it on an independent items (66). Subsequently, this approach has been applied to more sample B. attitudinal-type scales designed to measure latent constructs. The work of Chesney et al. on the Coping Self-Efficacy Under the CTT framework, the item difficulty index, also scale provides an example of this best practice in the use of called item easiness, is the proportion of correct answers on a independent samples (132). This study sought to investigate the given item, e.g., the proportion of correct answers on a math psychometric characteristics of the Coping Self-Efficacy (CSE) test (1, 2). It ranges between 0.0 and 1.0. A high difficulty score scale, and their samples came from two independent randomized means a greater proportion of the sample answered the question clinical trials. As such, two independent samples with four correctly. A lower difficulty score means a smaller proportion of different time points each (0, 3, 6, and 12 months) were used. the sample understood the question and answered correctly. This The authors administered the 26-item scale to the sample from may be due to the item being coded wrongly, ambiguity with the the first clinical trial and examined the covariance that existed item, confusing language, or ambiguity with response options. A between all the scale items (exploratory factor analysis) giving the lower difficulty score suggests a need to modify the items or delete hypothesized factor structure across time in that one trial. The them from the pool of items. obtained factor structure was then fitted to baseline data from the Under the IRT framework, the item difficulty parameter is second randomized clinical trial to test the hypothesized factor the probability of a particular examinee correctly answering any structure generated in the first sample (132). given item (67). This has the advantage of allowing the researcher to identify the different levels of individual performance on Step 5: Item Reduction Analysis specific questions, as well as develop particular questions In scale development, item reduction analysis is conducted to specific subgroups or populations (67). Item difficulty is to ensure that only parsimonious, functional, and internally estimated directly using logistic models instead of proportions. consistent items are ultimately included (133). Therefore, the goal Researchers must determine whether they need items with of this phase is to identify items that are not or are the least- low, medium, or high difficulty. For instance, researchers related to the domain under study for deletion or modification. interested in general purpose scales will focus on items with Two theories, Classical Test Theory (CTT) and the Item medium difficulty (68), i.e., the proportion with item assertions Response Theory (IRT), underpin scale development (134). CTT ranging from 0.4 to 0.6 (2, 68). The item difficulty index can be is considered the traditional test theory and IRT the modern test calculated using existing commands in Mplus, R, SAS, SPSS, or theory; both function to produce latent constructs. Each theory Stata. may be used singly or in conjunction to complement the other’s strengths (15, 135). Whether the researcher is using CTT or IRT, Item Discrimination Index the primary goal is to obtain functional items (i.e., items that are The item discrimination index (also called item-effectiveness correlated with each other, discriminate between individual cases, test), is the degree to which an item correctly differentiates underscore a single or multidimensional domain, and contribute between respondents or examinees on a construct of interest (69), significantly to the construct). and can be assessed under both CTT and IRT frameworks. It CTT allows the prediction of outcomes of constructs and the is a measure of the difference in performance between groups difficulty of items (136). CTT models assume that items forming on a construct. The upper group represents participants with constructs in their observed, manifest forms consist of a true high scores and the lower group those with poor or low scores. score on the domain of interest and a random error (which is The item discrimination index is “calculated by subtracting the the differences between the true score and a set of observed proportion of examinees in the lower group (lower %) from the scores by an individual) (137). IRT seeks to model the way in proportion of examinees in the upper group (upper %) who got which constructs manifest themselves in terms of observable the item correct or endorsed the item in the expected manner” item response (138). Comparatively, the IRT approach to scale (69). It differentiates between the number of students in an upper development has the advantage of allowing the researcher to group who get an item correct and the number of students in determine the effect of adding or deleting a given item or set a lower group who get the item correct (70). The use of an of items by examining the item information and standard error item discrimination index enables the identification of positively functions for the item pool (138). discriminating items (i.e., items that differentiate rightly between Several techniques exist within the two theories to reduce those who are knowledgeable about a subject and those who are the item pool, depending on which test theory is driving the not), negatively discriminating items (i.e., items which are poorly scale. The five major techniques used are: item difficulty and item designed such that the more knowledgeable get them wrong and discrimination indices, which are primarily for binary responses; the less knowledgeable get them right), and non-discriminating Frontiers in Public Health | www.frontiersin.org 9 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation item (i.e., items that fail to differentiate between participants who the tentative scale. Inter-item and item total correlations can be are knowledgeable about a subject and those who are not) (70). calculated using Mplus, R, SAS, SPSS, or Stata. The item discrimination index has been found to improve Distractor Efficiency Analysis test items in at least three ways. First, non-discriminating items, The distractor efficiency analysis shows the distribution of which fail to discriminate between respondents because they incorrect options and how they contribute to the quality of a may be too easy, too hard, or ambiguous, should be removed (71). Second, items which negatively discriminate, e.g., items multiple-choice item (77). The incorrect options, also known as distractors, are intentionally added in the response options to which fail to differentiate rightly between medically diagnosed depressed and non-depressed respondents on a happiness scale, attract students who do not know the correct answer in a test question (78). To calculate this, respondents will be grouped should be reexamined and modified (70, 71). Third, items that into three groups—high, middle, and lower tertiles based on positively discriminate should be retained, e.g., items that are their total scores on a set of items. Items will be regarded as correctly affirmed by a greater proportion of respondents who appropriate if 100% of those in the high group choose the are medically free of depression, with very low affirmation correct response options, about 50% of those in the middle by respondents diagnosed to be medically depressed (71). In choose the correct option, and few or none in the lower group some cases, it has been recommended that such positively choose the correct option (78). This type of analysis is rarely discriminating items be considered for revision (70) as the used in the health sciences, as most multiple-choice items are differences could be due to the level of difficulty of the item. An item discrimination index can be calculated through on a Likert-type response scale and do not test respondent correct knowledge, but their experience or perception. However, correlational analysis between the performance on an item and an overall criterion (69) using either the point biserial correlation distractor analysis can help to determine whether items are well-constructed, meaningful, and functional when researchers coefficient or the phi coefficient (72). add response options to questions that do not fit a particular Item discrimination under the IRT framework is a slope experience. It is expected that participants who are determined parameter that determines how steeply the probability of a as having poor knowledge or experience on the construct will correct response changes as the proficiency or trait increases choose the distractors, while those with the right knowledge (73). This allows differentiation between individuals with similar and experience will choose the correct response options (77, 79). abilities and can also be estimated using a logistic model. Under Where those with the right knowledge and experience are not certain conditions, the biserial correlation coefficient under the able to differentiate between distractors and the right response, CTT framework has proven to be identical to the IRT item the question may have to be modified. Non-functional distractors discrimination parameter (67, 74, 75); thus, as the trait increases identified need to be removed and replaced with efficient so does the probability of endorsing an item. These parameters can be computed using existing commands in Mplus, R, SAS, distractors (80). SPSS, or Stata. In both CTT and IRT, higher values are indicators Missing Cases of greater discrimination (73). In addition to these techniques, some researchers opt to delete items with large numbers of cases that are missing, when other missing data-handling techniques cannot be used (81). For cases Inter-item and Item-Total Correlations where modern missing data handling can be used, however, A third technique to support the deletion or modification of items several techniques exist to solve the problem of missing cases. is the estimation of inter-item and item-total correlations, which Two of the approaches have proven to be very useful for scale falls under CTT. These correlations often displayed in the form development: full information maximum likelihood (FIML) (82) of a matrix are used to examine relationships that exist between and multiple imputation (83). Both methods can be applied using individual items in a pool. existing commands in statistical packages such as Mplus, R, SAS, Inter-item correlations (also known as polychoric correlations and Stata. When using multiple imputation to recover missing for categorical variables and tetrachoric correlations for binary data in the context of survey research, the researcher can impute items) examines the extent to which scores on one item are individual items prior to computing scale scores or impute the related to scores on all other items in a scale (2, 68, 76). Also, scale scores from other scale scores (84). However, item-level it examines the extent to which items on a scale are assessing the imputation has been shown to produce more efficient estimates same content (76). Items with very low correlations (<0.30) are over scale-level imputation. Thus, imputing individual items less desirable and could be a cue for potential deletion from the before scale development is a preferred approach to imputing tentative scale. newly developed scales for missing cases (84). Item-total correlations (also known as polyserial correlations for categorical variables and biserial correlations for binary items) aim at examining the relationship between each item vs. the total Step 6: Extraction of Factors score of scale items. However, the adjusted item-total correlation, Factor extraction is the phase in which the optimal number of which examines the correlation between the item and the sum factors, sometimes called domains, that fit a set of items are score of the rest of the items excluding itself is preferred (1, 2). determined. This is done using factor analysis. Factor analysis Items with very low adjusted item-total correlations (<0.30) are is a regression model in which observed standardized variables less desirable and could be a cue for potential deletion from are regressed on unobserved (i.e., latent) factors. Because the Frontiers in Public Health | www.frontiersin.org 10 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation variables and factors are standardized, the bivariate regression restrictive ICM, in which cross-loadings between items and non- coefficients are also correlations, representing the loading of each target factors are assumed to be exactly zero. The systematic observed variable on each factor. Thus, factor analysis is used to fit assessment procedures are determined by meaningful understand the latent (internal) structure of a set of items, and the satisfactory thresholds; Table 2 contains the most common extent to which the relationships between the items are internally techniques for testing dimensionality. These techniques include consistent (4). This is done by extracting latent factors which the chi-square test of exact fit, Root Mean Square Error of represent the shared variance in responses among the multiple Approximation (RMSEA ≤ 0.06), Tucker Lewis Index (TLI ≥ items (4). The emphasis is on the number of factors, the salience 0.95), Comparative Fit Index (CFI ≥ 0.95), Standardized Root of factor loading estimates, and the relative magnitude of residual Mean Square Residual (SRMR ≤ 0.08), and Weighted Root Mean variances (2). Square Residual (WRMR ≤ 1.0) (90, 92–101). A number of analytical processes have been used to determine the number of factors to retain from a list of items, and it is Bifactor Modeling beyond the scope of this paper to describe all of them. For Bifactor modeling, also referred to as nested factor modeling, is scale development, commonly available methods to determine a form of item response theory used in testing dimensionality the number of factors to retain include a scree plot (85), the of a scale (102, 103). This method can be used when the variance explained by the factor model, and the pattern of factor hypothesized factor structure from the previous model produces loadings (2). Where feasible, researchers could also assess the partially overlapping dimensions so that one could be seeing optimal number of factors to be drawn from the list of items using most of the items loading onto one factor and a few items either parallel analysis (86), minimum average partial procedure loading onto a second and/or a third factor. The bifactor (87), or the Hull method (88, 89). model allows researchers to estimate a unidimensional construct The extraction of factors can also be used to reduce items. while recognizing the multidimensionality of the construct (104, With factor analysis, items with factor loadings or slope 105). The bifactor model assumes each item loads onto two coefficients that are below 0.30 are considered inadequate as dimensions, i.e., items forming the construct may be associated they contribute <10% variation of the latent construct measured. with more than one source of true score variance (92). The Hence, it is often recommended to retain items that have factor first is a general latent factor that underlies all the scale items loadings of 0.40 and above (2, 60). Also, items with cross-loadings and the second, a group factor (subscale). A “bifactor model or that appear not to load uniquely on individual factors can be is based on the assumption that a f -factor solution exists for a deleted. For single-factor models in which Rasch IRT modeling is set of n items with one [general]/Global (G) factor and f – 1 used, items are selected as having a good fit based on mean-square Specific (S) factors also called group factors” (92). This approach residual summary statistics (infit and outfit) >0.4 and <1.6 (90). allows researchers to examine any distortion that may occur A number of scales developed stop at this phase and jump when unidimensional IRT models are fit to multidimensional to tests of reliability, but the factors extracted at this point only data (104, 105). To determine whether to retain a construct as provide a hypothetical structure of the scale. The dimensionality unidimensional or multidimensional, the factor loadings from of these factors need to be tested (cf. Step 7) before moving on to the general factor are then compared to those from the group reliability (cf. Step 8) and validity (cf. Step 9) assessment. factors (103, 106). Where the factor loadings on the general factor are significantly larger than the group factors, a unidimensional scale is implied (103, 104). This method is assessed based on meaningful satisfactory thresholds. Alternatively, one can test for PHASE 3: SCALE EVALUATION the coexistence of a general factor that underlies the construct Step 7: Tests of Dimensionality and multiple group factors that explain the remaining variance The test of dimensionality is a test in which the hypothesized not explained by the general factor (92). Each of these methods factors or factor structure extracted from a previous model is can be done using statistical software such as Mplus, R, SAS, SPSS, tested at a different time point in a longitudinal study or, ideally, or Stata. on a new sample (91). Tests of dimensionality determine whether the measurement of items, their factors, and function are the Measurement Invariance same across two independent samples or within the same sample Another method to test dimensionality is measurement at different time points. Such tests can be conducted using invariance, also referred to as factorial invariance or independent cluster model (ICM)-confirmatory factor analysis, measurement equivalence (107). Measurement invariance bifactor modeling, or measurement invariance. concerns the extent to which the psychometric properties of the observed indicators are transportable (generalizable) Confirmatory Factor Analysis across groups or over time (108). These properties include the Confirmatory factor analysis is a form of psychometric hypothesized factor structure, regression slopes, intercept, and assessment that allows for the systematic comparison of an residual variances. Measurement invariance is tested sequentially alternative a priori factor structure based on systematic fit at five levels—configural, metric, scalar, strict (residual), assessment procedures and estimates the relationship between and structural (107, 109). Of key significance to the test of latent constructs, which have been corrected for measurement dimensionality is configural invariance, which is concerned with errors (92). Morin et al. (92) note that it relies on a highly whether the hypothesized factor structure is the same across Frontiers in Public Health | www.frontiersin.org 11 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation TABLE 2 | Description of model fit indices and thresholds for evaluating scales developed for health, social, and behavioral research. Model fit indices Description Recommended threshold to use References Chi-square test The chi-square value is a test statistic of the goodness of Chi-square test of model fit has been assessed to be (2, 93) fit of a factor model. It compares the observed overly sensitive to sample size and to vary when dealing covariance matrix with a theoretically proposed with non-normal variables. Hence, the use of non-normal covariance matrix data, a small sample size (n =180–300), and highly correlated items make the chi-square approximation inaccurate. An alternative to this is to use the Satorra-Bentler scaled (mean-adjusted) difference chi-squared statistic. The DIFFTEST has been recommended for models with binary and ordinal variables Root Mean Squared RMSEA is a measure of the estimated discrepancy Browne and Cudeck recommend RMSEA ≤ 0.05 as (26, 96–100) Error of Approximation between the population and model-implied population indicative of close fit, 0.05 ≤ RMSEA ≤ 0.08 as (RMSEA) covariance matrices per degree of freedom (139). indicative of fair fit, and values >0.10 as indicative of poor fit between the hypothesized model and the observed data. However, Hu and Bentler have suggested RMSEA ≤ 0.06 may indicate a good fit Tucker Lewis Index TLI is based on the idea of comparing the proposed Bentler and Bonnett suggest that models with overall fit (95–98) (TLI) factor model to a model in which no interrelationships at indices of <0.90 are generally inadequate and can be all are assumed among any of the items improved substantially. Hu and Bentler recommend TLI ≥ 0.95 Comparative Fit Index CFI is an incremental relative fit index that measures the CFI ≥ 0.95 is often considered an acceptable fit (95–98) (CFI) relative improvement in the fit of a researcher’s model over that of a baseline model Standardized Root SRMR is a measure of the mean absolute correlation Threshold for acceptable model fit is SRMR ≤ 0.08 (95–98) Mean Square Residual residual, the overall difference between the observed and (SRMR) predicted correlations Weighted Root Mean WRMR uses a “variance-weighted approach especially Yu recommends a threshold of WRMR <1.0 for (101) Square Residual suited for models whose variables measured on different assessing model fit. This index is used for confirmatory (WRMR) scales or have widely unequal variances” (139); it has factor analysis and structural equation models with been assessed to be most suitable in assessing models binary and ordinal variables fitted to binary and ordinal data Standard of Reliability A reliability of 0.90 is the minimum recommended Nunnally recommends a threshold of ≥0.90 for (117, 123) for scales threshold that should be tolerated while a reliability of assessing internal consistency for scales 0.95 should be the desirable standard. While the ideal has rarely been attained by most researchers, a reliability coefficient of 0.70 has often been accepted as satisfactory for most scales groups. This assumption has to be met in order for subsequent indices and the strength of factor loadings (cf. Table 2) are tests to be meaningful (107, 109). For example, a hypothesized the basis on which the latent structure of the items can be unidimensional structure, when tested across multiple countries, judged. should be the same. This can be tested in CTT, using multigroup One commonly encountered pitfall is a lack of satisfactory confirmatory factor analysis (110–112). global model fit in confirmatory factor analysis conducted on An alternative approach to measurement invariance in the a new sample following a satisfactory initial factor analysis testing of unidimensionality under item response theory is the performed on a previous sample. Lack of satisfactory fit offers Rasch measurement model for binary items and polytomous IRT the opportunity to identify additional underperforming items models for categorical items. Here, emphasis is on testing the for removal. Items with very poor loadings (≤0.3) can be differential item functioning (DIF)—an indicator of whether “a considered for removal. Also, modification indices, produced group of respondents is scoring better than another group of by Mplus and other structural equation modeling (SEM) respondents on an item or a test after adjusting for the overall programs, can help identify items that need to be modified. ability scores of the respondents” (108, 113). This is analogous Sometimes a higher-order factor structure, where correlations to the conditions underpinning measurement invariance in a among the original factors can be explained by one or more multi-group CFA (108, 113). higher-order factors, is needed. This can also be assessed Whether the hypothesized structure is bidimensional or using statistical software such as Mplus, R, SAS, SPSS, or multidimensional, each dimension in the structure needs to be Stata. tested again to confirm its unidimensionality. This can also be A good example of best practice is seen in the work of done using confirmatory factor analysis. Appropriate model fit Pushpanathan et al. on the appropriateness of using a traditional Frontiers in Public Health | www.frontiersin.org 12 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation confirmatory factor analysis or a bifactor model (114) in assessing however, 0.80 and 0.95 is preferred for the psychometric quality whether the Parkinson’s Disease Sleep Scale-Revised was better of scales (60, 117, 123). Cronbach’s alpha has been the most used as a unidimensional scale, a tri-dimensional scale, or a scale common and seems to have received general approval; however, that has an underlying general factor and three group factors reliability statistics such as Raykov’s rho, ordinal alpha, and (sub-scales). They tested this using three different models— Revelle’s beta, which are debated to have improvements over a unidimensional model (1-factor CFA); a 3-factor model (3 Cronbach’s alpha, are beginning to gain acceptance. factor CFA) consisting of sub-scales measuring insomnia, motor Test–Retest Reliability symptoms and obstructive sleep apnea, and REM sleep behavior disorder; and a confirmatory bifactor model having a general An additional approach in testing reliability is the test–retest reliability. The test–retest reliability, also known as the coefficient factor and the same three sub-scales combined. The results of this of stability, is used to assess the degree to which the participants’ study suggested that only the bifactor model with a general factor performance is repeatable, i.e., how consistent their sum scores and the three sub-scales combined achieved satisfactory model are across time (2). Researchers vary in how they assess test– fitness. Based on these results, the authors cautioned against the retest reliability. While some prefer to use intra class correlation use of a unidimensional total scale scores as a cardinal indicator coefficient (124), others use the Pearson product-moment of sleep in Parkinson’s disease, but encouraged the examination correlation (125). In both cases, the higher the correlation, of its multidimensional subscales (114). the higher the test–retest reliability, with values close to zero Scoring Scale Items indicating low reliability. In addition, study conditions could Finalized items from the tests of dimensionality can be used change values on the construct being measured over time (as to create scale scores for substantive analysis including tests in an intervention study, for example), which could lower the of reliability and validity. Scale scores can be calculated by test-retest reliability. using unweighted or weighted procedures. The unweighted The work of Johnson et al. (16) on the validation of the approach involves summing standardized item scores or raw item HIV Treatment Adherence Self-Efficacy Scale (ASES) is a good scores, or computing the mean for raw item scores (115). The example of the test of reliability. As part of testing for reliability, weighted approach in calculating scale scores can be produced the authors tested for the internal consistency reliability values via statistical software programs such as Mplus, R, SAS, SPSS, for the ASES and its subscales using Raykov’s rho (produces or Stata. For instance, in using confirmatory factor analysis, a coefficient similar to alpha but with fewer assumptions and structural equation models, or exploratory factor analysis, each with confidence intervals); they then tested for the temporal factor produced reveals a statistically independent source of consistency of the ASES’ factor structure. This was then followed variation among a set of items (115). The contribution of by test–retest reliability assessment among the latent factors. The each individual item to this factor is considered a weight, different approaches provided support for the reliability of the with the factor loading value representing the weight. The ASES scale. scores associated with each factor in a model then represents a Other approaches found to be useful and support scale composite scale score based on a weighted sum of the individual reliability include split-half estimates, Spearman-Brown formula, items using factor loadings (115). In general, it does not make alternate form method (coefficient of equivalence), and inter- much difference in the performance of the scale if scales are observer reliability (1, 2). computed as unweighted items (e.g., mean or sum scores) or weighted items (e.g., factor scores). Step 9: Tests of Validity Scale validity is the extent to which “an instrument indeed Step 8: Tests of Reliability measures the latent dimension or construct it was developed Reliability is the degree of consistency exhibited when a to evaluate” (2). Although it is discussed at length here in measurement is repeated under identical conditions (116). A Step 9, validation is an ongoing process that starts with the number of standard statistics have been developed to assess identification and definition of the domain of study (Step 1) and reliability of a scale, including Cronbach’s alpha (117), ordinal continues to its generalizability with other constructs (Step 9) alpha (118, 119) specific to binary and ordinal scale items, (36). The validity of an instrument can be examined in numerous test–retest reliability (coefficient of stability) (1, 2), McDonald’s ways; the most common tests of validity are content validity Omega (120), Raykov’s rho (2) or Revelle’s beta (121, 122), split- (described in Step 2), which can be done prior to the instrument half estimates, Spearman-Brown formula, alternate form method being administered to the target population, and criterion (coefficient of equivalence), and inter-observer reliability (1, 2). (predictive and concurrent) and construct validity (convergent, Of these statistics, Cronbach’s alpha and test–retest reliability are discriminant, differentiation by known groups, correlations), predominantly used to assess reliability of scales (2, 117). which occurs after survey administration. Cronbach’s Alpha Criterion Validity Cronbach’s alpha assesses the internal consistency of the scale Criterion validity is the “degree to which there is a relationship items, i.e., the degree to which the set of items in the scale co-vary, between a given test score and performance on another measure relative to their sum score (1, 2, 117). An alpha coefficient of 0.70 of particular relevance, typically referred to as criterion” (1, 2). has often been regarded as an acceptable threshold for reliability; There are two forms of criterion validity: predictive (criterion) Frontiers in Public Health | www.frontiersin.org 13 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation validity and concurrent (criterion) validity. Predictive validity is with other tests which are intended to measure the same “the extent to which a measure predicts the answers to some construct. other question or a result to which it ought to be related with” Discriminant validity is the extent to which a measure is (31). Thus, the scale should be able to predict a behavior in the novel and not simply a reflection of some other construct (126). future. An example is the ability for an exclusive breastfeeding Specifically, it is the “degree to which scores on a studied social support scale to predict exclusive breastfeeding (10). Here, instrument are differentiated from behavioral manifestations of the mother’s willingness to exclusively breastfeed occurs after other constructs, which on theoretical grounds can be expected social support has been given, i.e., it should predict the behavior. not to be related to the construct underlying the instrument Predictive validity can be estimated by examining the association under investigation” (2). This is best estimated through the multi- between the scale scores and the criterion in question. trait multi method matrix (2). Discriminant validity is indicated Concurrent criterion validity is the extent to which test by predictably low or weak correlations between the measure of scores have a stronger relationship with criterion (gold standard) interest and other measures that are supposedly not measuring measurement made at the time of test administration or shortly the same variable or concept (126). The newly developed afterward (2). This can be estimated using Pearson product- construct can be invalidated by too high correlations with other moment correlation or latent variable modeling. The work tests which are intended to differ in their measurements (37). of Greca and Stone on the psychometric evaluation of the This approach is critical in differentiating the newly developed revised version of a social anxiety scale for children (SASC- construct from other rival alternatives (36). R) provides a good example for the evaluation of concurrent Differentiation or comparison between known groups validity (140). In this study, the authors collected data on examines the distribution of a newly developed scale score an earlier validated version of the SASC scale consisting of over known binary items (126). This is premised on previous 10 items, as well as the revised version, SASC-R, which had theoretical and empirical knowledge of the performance of the additional 16 items making a 26-item scale. The SASC consisted binary groups. An example of best practice is seen in the work of of two sub scales [fear of negative evaluation (FNE), social Boateng et al. on the validation of a household water insecurity avoidance and distress (SAD)] and the SASC-R produced three scale in Kenya. In this study, we compared the mean household new subscales (FNE, SAD-New, and SAD-General). Using a water insecurity scores over households with or without E. coli Pearson product-moment correlation, the authors examined the present in their drinking water. Consistent with what we knew inter-correlations between the common subscales for FNE, and from the extant literature, we found households with E. coli between SAD and SAD-New. With a validity coefficient of 0.94 present in their drinking water had higher mean water insecurity and 0.88, respectively, the authors found evidence of concurrent scores than households that had no E. coli in drinking water. validity. This suggested our scale could discriminate between particular A limitation of concurrent validity is that this strategy for known groups. validity does not work with small sample sizes because of their Although correlational analysis is frequently used by large sampling errors. Secondly, appropriate criterion variables several scholars, bivariate regression analysis is preferred or “gold standards” may not be available (2). This reason may to correlational analysis for quantifying validity (127, 128). account for its omission in most validation studies. Regression analysis between scale scores and an indicator of the domain examined has a number of important advantages Construct Validity over correlational analysis. First, regression analysis quantifies Construct validity is the “extent to which an instrument the association in meaningful units, facilitating judgment assesses a construct of concern and is associated with evidence of validity. Second, regression analysis avoids confounding that measures other constructs in that domain and measures validity with the underlying variation in the sample and specific real-world criteria” (2). Four indicators of construct therefore the results from one sample are more applicable to validity are relevant to scale development: convergent validity, other samples in which the underlying variation may differ. discriminant validity, differentiation by known groups, and Third, regression analysis is preferred because the regression correlation analysis. model can be used to examine discriminant validity by adding Convergent validity is the extent to which a construct potential alternative measures. In addition to regression analysis, measured in different ways yields similar results. Specifically, alternative techniques such as analysis of standard deviations of it is the “degree to which scores on a studied instrument are the differences between scores and the examination of intraclass related to measures of other constructs that can be expected correlation coefficients (ICC) have been recommended as viable on theoretical grounds to be close to the one tapped into by options (128). this instrument” (2, 37, 126). This is best estimated through Taken together, these methods make it possible to assess the multi-trait multi-method matrix (2), although in some cases the validity of an adapted or a newly developed scale. In researchers have used either latent variable modeling or Pearson addition to predictive validity, existing studies in fields such as product-moment correlation based on Fisher’s Z transformation. health, social, and behavioral sciences have shown that scale Evidence of convergent validity of a construct can be provided by validity is supported if at least two of the different forms of the extent to which the newly developed scale correlates highly construct validity discussed in this section have been examined. with other variables designed to measure the same construct Further information about establishing validity and constructing (2, 126). It can be invalidated by too low or weak correlations indictors from scales can be found in Frongillo et al. (141). Frontiers in Public Health | www.frontiersin.org 14 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation CONCLUSIONS they will include, rather than omitting a step out of lack of knowledge. In sum, we have sought to give an overview of the key steps in Well-designed scales are the foundation of much of our scale development and validation (Figure 1) as well as to help the understanding of a range of phenomena, but ensuring that reader understand how one might approach each step (Table 1). we accurately quantify what we purport to measure is not a We have also given a basic introduction to the conceptual and simple matter. By making scale development more approachable methodological underpinnings of each step. and transparent, we hope to facilitate the advancement of Because scale development is so complicated, this should our understanding of a range of health, social, and behavioral be considered a primer, i.e., a “jumping off point” for anyone outcomes. interested in scale development. The technical literature and examples of rigorous scale development mentioned throughout AUTHOR CONTRIBUTIONS will be important for readers to pursue. There are a number of matters not addressed here, including how to interpret GB and SY developed the first draft of the scale development and validation manuscript. All authors participated in the editing and scale output, the designation of cut-offs, when indices, rather than scales, are more appropriate, and principles for re- critical revision of the manuscript and approved the final version of the manuscript for publication. testing scales in new populations. Also, this review leans more toward the classical test theory approach to scale development; a comprehensive review on IRT modeling will FUNDING be complementary. We hope this review helps to ease readers Funding for this work was obtained by SY through the National into the literature, but space precludes consideration of all these Institute of Mental Health—R21 MH108444. The content is topics. solely the responsibility of the authors and does not necessarily The necessity of the nine steps that we have outlined here represent the official views of the National Institute of Mental (Table 1, Figure 1) will vary from study to study. While studies Health or the National Institutes of Health. focusing on developing scales de novo may use all nine steps, others, e.g., those that set out to validate existing scales, may end up using only the last four steps. Resource constraints, including ACKNOWLEDGMENTS time, money, and participant attention and patience are very We would like to acknowledge the importance of the works real, and must be acknowledged as additional limits to rigorous of several scholars of scale development and validation scale development. We cannot state which steps are the most used in developing this primer, particularly Robert DeVellis, important; difficult decisions about which steps to approach less rigorously can only be made by each scale developer, based on Tenko Raykov, George Marcoulides, David Streiner, and Betsy McCoach. We would also like to acknowledge the help of Josh the purpose of the research, the proposed end-users of the scale, and resources available. It is our hope, however, that by outlining Miller of Northwestern University for assisting with design of Figure 1 and development of Table 1, and we thank Zeina the general shape of the phases and steps in scale development, researchers will be able to purposively choose the steps that Jamuladdine for helpful comments on tests of unidimensionality. REFERENCES 9. Hirani SAA, Karmaliani R, Christie T, Rafique G. Perceived Breastfeeding Support Assessment Tool (PBSAT): development and 1. DeVellis RF. Scale Development: Theory and Application. Los Angeles, CA: testing of psychometric properties with Pakistani urban working Sage Publications (2012). mothers. Midwifery (2013) 29:599–607. doi: 10.1016/j.midw.2012. 2. Raykov T, Marcoulides GA. Introduction to Psychometric Theory. New York, 05.003 NY: Routledge, Taylor & Francis Group (2011). 10. Boateng GO, Martin S., Collins S, Natamba BK, Young SL. Measuring 3. Streiner DL, Norman GR, Cairney J. Health Measurement Scales: A Practical exclusive breastfeeding social support: scale development and validation in Guide to Their Development and Use. Oxford University Press (2015). Uganda. Matern Child Nutr. (2018). doi: 10.1111/mcn.12579. [Epub ahead of 4. McCoach DB, Gable RK, Madura, JP. Instrument Development in the Affective print]. Domain. School and Corporate Applications, 3rd Edn. New York, NY: 11. Arbach A, Natamba BK, Achan J, Griffiths JK, Stoltzfus RJ, Mehta S, Springer (2013). et al. Reliability and validity of the center for epidemiologic studies- 5. Morgado FFR, Meireles JFF, Neves CM, Amaral ACS, Ferreira MEC. depression scale in screening for depression among HIV-infected Scale development: ten main limitations and recommendations to and -uninfected pregnant women attending antenatal services in improve future research practices. Psicol Reflex E Crítica (2018) 30:3. northern Uganda: a cross-sectional study. BMC Psychiatry (2014) 14:303. doi: 10.1186/s41155-016-0057-1 doi: 10.1186/s12888-014-0303-y 6. Glanz K, Rimer BK, Viswanath K. Health Behavior: Theory, Research, and 12. Natamba BK, Kilama H, Arbach A, Achan J, Griffiths JK, Young SL. Practice. San Francisco, CA: John Wiley & Sons, Inc (2015). Reliability and validity of an individually focused food insecurity access 7. Ajzen I. From intentions to actions: a theory of planned behavior. In: Action scale for assessing inadequate access to food among pregnant Ugandan Control SSSP Springer Series in Social Psychology Berlin; Heidelberg: Springer, women of mixed HIV status. Public Health Nutr. (2015) 18:2895–905. (1985). p. 11–39. doi: 10.1017/S1368980014001669 8. Bai Y, Peng C-YJ, Fly AD. Validation of a short questionnaire to assess 13. Neilands TB, Chakravarty D, Darbes LA, Beougher SC, Hoff CC. mothers’ perception of workplace breastfeeding support. J Acad Nutr Diet Development and validation of the sexual agreement investment scale. J Sex (2008) 108:1221–5. doi: 10.1016/j.jada.2008.04.018 Res. (2010) 47:24–37. doi: 10.1080/00224490902916017 Frontiers in Public Health | www.frontiersin.org 15 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation 14. Neilands TB, Choi K-H. A validation and reduced form of the new and existing techniques. MIS Q. (2011) 35:293. doi: 10.2307/23 female condom attitudes scale. AIDS Educ Prev. (2002) 14:158–71. 044045 doi: 10.1521/aeap.14.2.158.23903 36. Messick S. Validity of psychological assessment: validation of inferences 15. Lippman SA, Neilands TB, Leslie HH, Maman S, MacPhail C, Twine from persons’ responses and performance as scientifica inquiry into R, et al. Development, validation, and performance of a scale to score meaning. Am Psychol. (1995) 50:741–9. doi: 10.1037/0003-066X. measure community mobilization. Soc Sci Med. (2016) 157:127–37. 50.9.741 doi: 10.1016/j.socscimed.2016.04.002 37. Campbell DT, Fiske DW. Convergent and discriminant validity by 16. Johnson MO, Neilands TB, Dilworth SE, Morin SF, Remien RH, Chesney the multitrait-multimethod matrix. Psychol Bull. (1959) 56:81–105. MA. The role of self-efficacy in HIV treatment adherence: validation of doi: 10.1037/h0046016 the HIV treatment adherence self-efficacy scale (HIV-ASES). J Behav Med. 38. Dennis C. Theoretical underpinnings of breastfeeding confidence: a self- (2007) 30:359–70. doi: 10.1007/s10865-007-9118-3 efficacy framework. J Hum Lact. (1999) 15:195–201. doi: 10.1177/08903 17. Sexton JB, Helmreich RL, Neilands TB, Rowan K, Vella K, Boyden 3449901500303 J, et al. The Safety Attitudes Questionnaire: psychometric properties, 39. Dennis C-L, Faux S. Development and psychometric testing of the benchmarking data, and emerging research. BMC Health Serv Res. (2006) Breastfeeding Self-Efficacy Scale. Res Nurs Health (1999) 22:399–409. doi: 10. 6:44. doi: 10.1186/1472-6963-6-44 1002/(SICI)1098-240X(199910)22:5<399::AID-NUR6>3.0.CO;2-4 18. Wolfe WS, Frongillo EA. Building household food-security measurement 40. Dennis C-L. The breastfeeding self-efficacy scale: psychometric assessment tools from the ground up. Food Nutr Bull. (2001) 22:5–12. of the short form. J Obstet Gynecol Neonatal Nurs. (2003) 32:734–44. doi: 10.1177/156482650102200102 doi: 10.1177/0884217503258459 19. González W, Jiménez A, Madrigal G, Muñoz LM, Frongillo EA. 41. Frongillo EA, Nanama S. Development and validation of an experience- Development and validation of measure of household food insecurity in based measure of household food insecurity within and across urban costa rica confirms proposed generic questionnaire. J Nutr. (2008) seasons in Northern Burkina Faso. J Nutr. (2006) 136:1409S−19S. 138:587–92. doi: 10.1093/jn/138.3.587 doi: 10.1093/jn/136.5.1409S 20. Boateng GO, Collins SM, Mbullo P, Wekesa P, Onono M, Neilands T, et 42. Guion R. Content validity - the source of my discontent. Appl Psychol Meas. al. A novel household water insecurity scale: procedures and psychometric (1977) 1:1–10. doi: 10.1177/014662167700100103 analysis among postpartum women in western Kenya. PloS ONE. (2018). 43. Lawshe C. A quantitative approach to content validity. Pers Psychol. (1975) doi: 10.1371/journal.pone.0198591 28:563–75. doi: 10.1111/j.1744-6570.1975.tb01393.x 21. Melgar-Quinonez H, Hackett M. Measuring household food 44. Lynn M. Determination and quantification of content validity. Nurs Res. security: the global experience. Rev Nutr. (2008) 21:27s−37s. (1986) 35:382–5. doi: 10.1097/00006199-198611000-00017 doi: 10.1590/S1415-52732008000700004 45. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 22. Melgar-Quiñonez H, Zubieta AC, Valdez E, Whitelaw B, Kaiser L. (1960) 20:37–46. doi: 10.1177/001316446002000104 Validación de un instrumento para vigilar la inseguridad alimentaria en 46. Wynd CA, Schmidt B, Schaefer MA. Two quantitative approaches la Sierra de Manantlán, Jalisco. Salud Pública México (2005) 47:413–22. for estimating content validity. West J Nurs Res. (2003) 25:508–18. doi: 10.1590/S0036-36342005000600005 doi: 10.1177/0193945903252998 23. Hackett M, Melgar-Quinonez H, Uribe MCA. Internal validity of a 47. Linstone HA, Turoff M. (eds). The Delphi Method. Reading, MA: Addison- household food security scale is consistent among diverse populations Wesley (1975). participating in a food supplement program in Colombia. BMC Public Health 48. Augustine LF, Vazir S, Rao SF, Rao MV, Laxmaiah A, Ravinder (2008) 8:175. doi: 10.1186/1471-2458-8-175 P, et al. Psychometric validation of a knowledge questionnaire 24. Hinkin TR. A review of scale development practices in the study on micronutrients among adolescents and its relationship to of organizations. J Manag. (1995) 21:967–88. doi: 10.1016/0149- micronutrient status of 15–19-year-old adolescent boys, Hyderabad, 2063(95)90050-0 India. Public Health Nutr. (2012) 15:1182–9. doi: 10.1017/S13689800120 25. Haynes SN, Richard DCS, Kubany ES. Content validity in psychological 00055 assessment: a functional approach to concepts and methods. Pyschol Assess. 49. Beatty PC, Willis GB. Research synthesis: the practice of cognitive (1995) 7:238–47. doi: 10.1037/1040-3590.7.3.238 interviewing. Public Opin Q. (2007) 71:287–311. doi: 10.1093/poq/nfm006 26. Kline P. A Handbook of Psychological Testing. 2nd Edn. London: Routledge; 50. Alaimo K, Olson CM, Frongillo EA. Importance of cognitive testing for Taylor & Francis Group (1993). survey items: an example from food security questionnaires. J Nutr Educ. 27. Hunt SD. Modern Marketing Theory. Cincinnati: South-Western Publishing (1999) 31:269–75. doi: 10.1016/S0022-3182(99)70463-2 (1991). 51. Willis GB. Cognitive Interviewing and Questionnaire Design: A Training 28. Loevinger J. Objective tests as instruments of psychological theory. Psychol Manual. Cognitive Methods Staff Working Paper Series. Hyattsville, MD: Rep. (1957) 3:635–94. doi: 10.2466/pr0.1957.3.3.635 National Center for Health Statistics (1994). 29. Clarke LA, Watson D. Constructing validity: basic issues in 52. Willis GB. Cognitive Interviewing: A Tool for Improving Questionnaire objective scale development. Pyschol Assess. (1995) 7:309–19. Design. Thousand Oaks, CA: Sage Publications (2005). doi: 10.1037/1040-3590.7.3.309 53. Tourangeau R. Cognitive aspects of survey measurement and 30. Schinka JA, Velicer WF, Weiner IR. Handbook of Psychology, Vol. 2, Research mismeasurement. Int J Public Opin Res. (2003) 15:3–7. doi: 10.1093/ Methods in Psychology. Hoboken, NJ: John Wiley & Sons, Inc. (2012). ijpor/15.1.3 31. Fowler FJ. Improving Survey Questions: Design and Evaluation. Thousand 54. Morris MD, Neilands TB, Andrew E, Mahar L, Page KA, Hahn Oaks, CA: Sage Publications (1995). JA. Development and validation of a novel scale for measuring 32. Krosnick JA. Questionnaire design. In: Vannette DL, Krosnick JA, editors. interpersonal factors underlying injection drug using behaviours The Palgrave Handbook of Survey Research. Cham: Palgrave Macmillan among injecting partnerships. Int J Drug Policy (2017) 48:54–62. (2018), pp. 439–55. doi: 10.1016/j.drugpo.2017.05.030 33. Krosnick JA, Presser S. Question and questionnaire design. In: Wright JD, 55. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research Marsden PV, editors. Handbook of Survey Research. San Diego, CA: Elsevier electronic data capture (REDCap)—a metadata-driven methodology and (2009), pp. 263–314. workflow process for providing translational research informatics support. 34. Rhemtulla M, Brosseau-Liard PÉ, Savalei V. When can categorical variables J Biomed Inform. (2009) 42:377–81. doi: 10.1016/j.jbi.2008.08.010 be treated as continuous? A comparison of robust continuous and categorical 56. GoldsteinM, Benerjee R, Kilic T. Paper v Plastic Part 1: The Survey SEM estimation methods under suboptimal conditions. Psychol Methods Revolution Is in Progress. The World Bank Development Impact. (2012). (2012) 17:354–73. doi: 10.1037/a0029315 Available online at: http://blogs.worldbank.org/impactevaluations/paper- 35. MacKenzie SB, Podsakoff PM, Podsakoff NP. Construct measurement v-plastic-part-i-the-survey-revolution-is-in-progress (Accessed November and validation procedures in MIS and behavioral research: integrating 10, 2017). Frontiers in Public Health | www.frontiersin.org 16 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation 57. Fanning J, McAuley E. A Comparison of tablet computer and paper-based 83. Kenward MG, Carpenter J. Multiple imputation: current perspectives. Stat questionnaires in healthy aging research. JMIR Res Protoc. (2014) 3:e38. Methods Med Res. (2007) 16:199–218. doi: 10.1177/0962280206075304 doi: 10.2196/resprot.3291 84. Gottschall AC, West SG, Enders CK. A Comparison of item-level and scale- 58. Greenlaw C, Brown-Welty S. A Comparison of web-based and paper-based level multiple imputation for questionnaire batteries. Multivar Behav Res. survey methods: testing assumptions of survey mode and response cost. Eval (2012) 47:1–25. doi: 10.1080/00273171.2012.640589 Rev. (2009) 33:464–80. doi: 10.1177/0193841X09340214 85. Cattell RB. The Scree test for the number of factors. Multivar Behav Res. 59. MacCallum RC, Widaman KF, Zhang S, Hong S. Sample size in factor (1966) 1: 245–76. doi: 10.1207/s15327906mbr0102_10 analysis. Psychol Methods (1999) 4:84–99. doi: 10.1037/1082-989X.4.1.84 86. Horn JL. A rationale and test for the number of factors in factor analysis. 60. Nunnally JC. Pyschometric Theory. New York, NY: McGraw-Hill (1978). Psychometrika (1965) 30:179–85. doi: 10.1007/BF02289447 61. Guadagnoli E, Velicer WF. Relation of sample size to the stability 87. Velicer WF. Determining the number of components from the of component patterns. Am Psychol Assoc. (1988) 103:265–75. matrix of partial correlations. Psychometrika (1976) 41:321–7. doi: 10.1037/0033-2909.103.2.265 doi: 10.1007/BF02293557 62. Comrey AL. Factor-analytic methods of scale development in personality 88. Lorenzo-Seva U, Timmerman ME, Kiers HAL. The hull method for selecting and clinical psychology. Am Psychol Assoc. (1988) 56:754–61. the number of common factors. Multivar Behav Res. (2011) 46:340–64. 63. Comrey AL, Lee H. A First Cours in Factor Analysis. Hillsdale, NJ: Lawrence doi: 10.1080/00273171.2011.564527 Erlbaum Associates, Inc. (1992). 89. Jolijn Hendriks AA, Perugini M, Angleitner A, Ostendorf F, Johnson 64. Ong DC. A Primer to Bootstrapping and an Overview of doBootstrap. JA, De Fruyt F, et al. The five-factor personality inventory: cross- Stanford, CA: Department of Psychology, Stanford University (2014). cultural generalizability across 13 countries. Eur J Pers. (2003) 17:347–73. 65. Osborne JW, Costello AB. Sample size and subject to item ratio in principal doi: 10.1002/per.491 components analysis. Pract Assess Res Eval. (2004) 99:1–15. Available online 90. Bond TG, Fox C. Applying the Rasch Model: Fundamental Measurement in at: http://pareonline.net/htm/v9n11.htm the Human Sciences. Mahwah, NJ: Erlbaum (2013). 66. Ebel R., Frisbie D. Essentials of Educational Measurement. Englewood Cliffs, 91. Brown T. Confirmatory Factor Analysis for Applied Research. New York, NY: NJ: Prentice-Hall (1979). Guildford Press (2014). 67. Hambleton R., Jones R. An NCME instructional module on comparison 92. Morin AJS, Arens AK, Marsh HW. A bifactor exploratory structural equation of classical test theory and item response theory and their applications modeling framework for the identification of distinct sources of construct- to test development. Educ Meas Issues Pract. (1993) 12:38–47. relevant psychometric multidimensionality. Struct Equ Model Multidiscip J. doi: 10.1111/j.1745-3992.1993.tb00543.x (2016) 23:116–39. doi: 10.1080/10705511.2014.961800 68. Raykov T. Scale Construction and Development. Lecture Notes. Measurement 93. Cochran WG. The χ test of goodness of fit. Ann Math Stat. (1952) 23:315– and Quantitative Methods. East Lansing, MI: Michigan State University 45. doi: 10.1214/aoms/1177729380 (2015). 94. Brown MW. Confirmatory Factor Analysis for Applied Research. New York, 69. Whiston SC. Principles and Applications of Assessment in Counseling. NY: Guildford Press (2014). Cengage Learning (2008). 95. Tucker LR, Lewis C. A reliability coefficient for maximum likelihood factor 70. Brennan RL. A generalized upper-lower item discrimination index. analysis. Psychometrika (1973) 38:1–10. doi: 10.1007/BF02291170 Educ Psychol Meas. (1972) 32:289–303. doi: 10.1177/0013164472032 96. Bentler PM, Bonett DG. Significance tests and goodness of fit in 00206 the analysis of covariance structures. Psychol Bull. (1980) 88:588–606. 71. Popham WJ, Husek TR. Implications of criterion-referenced measurement. doi: 10.1037/0033-2909.88.3.588 J Educ Meas. (1969) 6:1–9. doi: 10.1111/j.1745-3984.1969.tb00654.x 97. Bentler PM. Comparative fit indexes in structural models. Psychol Bull. 72. Rasiah S-MS, Isaiah R. Relationship between item difficulty and (1990) 107:238–46. doi: 10.1037/0033-2909.107.2.238 discrimination indices in true/false-type multiple choice questions of 98. Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure a para-clinical multidisciplinary paper. Ann Acad Med Singap. (2006) analysis: Conventional criteria versus new alternatives. Struct Equ Model 35:67–71. Available online at: http://repository.um.edu.my/id/eprint/65455 Multidiscip J. (1999) 6:1–55. doi: 10.1080/10705519909540118 73. Demars C. Item Respons Theory. New York, NY: Oxford University Press 99. Jöreskog KG, Sörbom D. LISREL 8.54. Structural Equation Modeling With (2010). the Simplis Command Language (2004) Available online at: http://www.unc. 74. Lord FM. Applications of Item Response Theory to Practical Testing Problems. edu/~rcm/psy236/holzcfa.lisrel.pdf New Jersey, NJ: Englewood Cliffs (1980). 100. Browne MW, Cudeck R. Alternative ways of assessing model fit. In: Bollen 75. Bazaldua DAL, Lee Y-S, Keller B, Fellers L. Assessing the KA, Long, JS, editors. Testing Structural Equation Models. Newbury Park, performance of classical test theory item discrimination estimators CA: Sage Publications (1993). p. 136–62. in Monte Carlo simulations. Asia Pac Educ Rev. (2017) 18:585–98. 101. Yu C. Evaluating Cutoff Criteria of Model Fit Indices for Latent Variable doi: 10.1007/s12564-017-9507-4 Models With Binary and Continuous Outcomes. Los Angeles, CA: University 76. Piedmont RL. Inter-item correlations. In Encyclopedia of Quality of of California, Los Angeles. (2002). Life and Well-Being Research. Dordrecht: Springer (2014). p. 3303–4. 102. Gerbing DW, Hamilton JG. Viability of exploratory factor analysis as a doi: 10.1007/978-94-007-0753-5_1493 precursor to confirmatory factor analysis. Struct Equ Model Multidiscip J. 77. Tarrant M, Ware J, Mohammed AM. An assessment of functioning and non- (1996) 3:62–72. doi: 10.1080/10705519609540030 functioning distractors in multiple-choice questions: a descriptive analysis. 103. Reise SP, Morizot J, Hays RD. The role of the bifactor model in resolving BMC Med Educ. (2009) 9:40. doi: 10.1186/1472-6920-9-40 dimensionality issues in health outcomes measures. Qual Life Res. (2007) 78. Fulcher G, Davidson F. The Routledge Handbook of Language Testing. New 16:19–31. doi: 10.1007/s11136-007-9183-7 York, NY: Routledge (2012). 104. Gibbons RD, Hedeker DR. Full-information item bi-factor analysis. 79. Cizek GJ, O’Day DM. Further investigation of nonfunctioning options in Psychometrika (1992) 57:423–36. doi: 10.1007/BF02295430 multiple-choice test items. Educ Psychol Meas. (1994) 54:861–72. 105. Reise SP, Moore TM, Haviland MG. Bifactor models and rotations: exploring 80. Haladyna TM, Downing SM. Validity of a taxonomy of multiple-choice the extent to which multidimensional data yield univocal scale scores. J Pers item-writing rules. Appl Meas Educ. (1989) 2:51–78. doi: 10.1207/s153248 Assess. (2010) 92:544–59. doi: 10.1080/00223891.2010.496477 18ame0201_4 106. Brunner M, Nagy G, Wilhelm O. A Tutorial on hierarchically structured 81. Tappen RM. Advanced Nursing Research. Sudbury, MA: Jones & Bartlett constructs. J Pers. (2012) 80:796–846. doi: 10.1111/j.1467-6494.2011.00749.x Publishers (2011). 107. Vandenberg RJ, Lance CE. A review and synthesis of the measurement 82. Enders CK, Bandalos DL. The relative performance of full invariance literature: suggestions, practices, and recommendations information maximum likelihood estimation for missing data in for organizational research - Robert J. Vandenberg, Charles E. Lance, structural equation models. Struct Equ Model. (2009) 8:430–57. 2000. Organ Res Methods (2000) 3:4–70. doi: 10.1177/10944281 doi: 10.1207/S15328007SEM0803_5 0031002 Frontiers in Public Health | www.frontiersin.org 17 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation 108. Sideridis GD, Tsaousis I, Al-harbi KA. Multi-population invariance with of dietary assessment methods. Eur J Epidemiol. (1991) 7:339–43. dichotomous measures: combining multi-group and MIMIC methodologies doi: 10.1007/BF00144997 in evaluating the general aptitude test in the arabic language - Georgios D. 129. McPhail SM. Alternative Validation Strategies: Developing New and Sideridis, Ioannis Tsaousis, Khaleel A. Al-harbi, 2015. J Psychoeduc Assess. Leveraging Existing Validity Evidence. San Francisco, CA: John Wiley & Sons, 33:568–84. doi: 10.1177/0734282914567871 Inc (2007). 109. Joreskog K. A general method for estimating a linear equation system. In: 130. Dray S, Dunsch F, Holmlund M. Electronic Versus Paper-Based Data Goldberger AS, Duncan OD, editors. Structural Equation Models in the Social Collection: Reviewing the Debate. The World Bank Development Impact Sciences. New York, NY: Seminar Press (1973). pp. 85–112. (2016). Available online at: https://blogs.worldbank.org/impactevaluations/ 110. Kim ES, Cao C, Wang Y, Nguyen DT. Measurement invariance testing with electronic-versus-paper-based-data-collection-reviewing-debate (Accessed many groups: a comparison of five approaches. Struct Equ Model Multidiscip November 10, 2017). J. (2017) 24:524–44. doi: 10.1080/10705511.2017.1304822 131. Ellen JM, Gurvey JE, Pasch L, Tschann J, Nanda JP, Catania J. A 111. Muthén B., Asparouhov T. BSEM Measurement Invariance Analysis. randomized comparison of A-CASI and phone interviews to assess (2017). Available online at: https://www.statmodel.com/examples/webnotes/ STD/HIV-related risk behaviors in teens. J Adolesc Health (2002) 31:26–30. webnote17.pdf doi: 10.1016/S1054-139X(01)00404-9 112. Asparouhov T, Muthén B. Multiple-group factor analysis alignment. Struct 132. Chesney MA, Neilands TB, Chambers DB, Taylor JM, Folkman S. Equ Model. 21:495–508. doi: 10.1080/10705511.2014.919210 A validity and reliability study of the coping self-efficacy scale. Br 113. Reise SP, Widaman KF, Pugh RH. Confirmatory factor analysis and item J Health Psychol. (2006) 11(Pt 3):421–37. doi: 10.1348/135910705 response theory: two approaches for exploring measurement invariance. X53155 Psychol Bull. (1993) 114:552–66. doi: 10.1037/0033-2909.114.3.552 133. Thurstone L. Multiple-Factor Analysis. Chicago, IL: University of Chicago 114. Pushpanathan ME, Loftus AM, Gasson N, Thomas MG, Timms CF, Press (1947). Olaithe M, et al. Beyond factor analysis: multidimensionality and the 134. Fan X. Item response theory and classical test theory: an empirical Parkinson’s disease sleep scale-revised. PLoS ONE (2018) 13:e0192394. comparison of their item/person statistics. Educ Psychol Meas. (1998) doi: 10.1371/journal.pone.0192394 58:357–81. doi: 10.1177/0013164498058003001 115. Armor DJ. Theta reliability and factor scaling. Sociol Methodol. (1973) 135. Glockner-Rist A, Hoijtink H. The best of both worlds: factor analysis 5:17–50. doi: 10.2307/270831 of dichotomous data using item response theory and structural 116. Porta M. A Dictionary of Epidemiology. New York, NY: Oxford University equation modeling. Struct Equ Model Multidiscip J. (2003) 10:544–65. Press (2008). doi: 10.1207/S15328007SEM1004_4 117. Cronbach LJ. Coefficient alpha and the internal structure of tests. 136. Keeves JP, Alagumalai S, editors. Applied Rasch Measurement: A Book of Psychometrika (1951) 16:297–334. doi: 10.1007/BF02310555 Exemplars: Papers in Honour of John P. Keeves. Dordrecht ; Norwell, MA: 118. Zumbo B, Gadermann A, Zeisser C. Ordinal versions of coefficients alpha Springer (2005). and theta for likert rating scales. J Mod Appl Stat Methods (2007) 6:21–9. 137. Cappelleri JC, Lundy JJ, Hays RD. Overview of classical test theory and item doi: 10.22237/jmasm/1177992180 response theory for quantitative assessment of items in developing 119. Gadermann AM, GuhnM, Zumbo B. Estimating ordinal reliability for Likert patient-reported outcome measures. Clin Ther. (2014) 36:648–62. type and ordinal item response data: a conceptual, empirical, and practical doi: 10.1016/j.clinthera.2014.04.006 guide. Pract Assess Res Eval. (2012) 17:1–13. Available online at: http://www. 138. Harvey RJ, Hammer AL. Item response theory. Couns Psychol. (1999) pareonline.net/getvn.asp?v=17&n=3 27:353–83. doi: 10.1177/0011000099273004 120. McDonald RP. Test Theory: A Unified Treatment. New Jersey, NJ : Lawrence 139. Cook KF, Kallen MA, Amtmann D. Having a fit: impact of number Erlbaum Associates, Inc (1999). of items and distribution of data on traditional criteria for assessing 121. Revelle W. Hierarchical cluster analysis and the internal structure of tests. IRT’s unidimensionality assumption. Qual. Life Res. (2009) 18:447–60. Multivar Behav Res. (1979) 14:57–74. doi: 10.1207/s15327906mbr1401_4 doi: 10.1007/s11136-009-9464-4 122. Revelle W, Zinbarg RE. Coefficients alpha, beta, omega, and the glb: 140. Greca AML, Stone WL. Social anxiety scale for children-revised: factor comments on Sijtsma. Psychometrika (2009) 74:145. doi: 10.1007/s11336- structure and concurrent validity. J Clin Child Psychol. (1993) 22:17–27. 008-9102-z doi: 10.1207/s15374424jccp2201_2 123. Bernstein I, Nunnally JC. Pyschometric Theory. New York, NY: McGraw-Hill 141. Frongillo EA, Nanama S, Wolfe WS. Technical Guide to Developing a (1994). Direct, Experience-Based Measurement Tool for Household Food Insecurity. 124. Weir JP. JP: Quantifying test-retest reliability using the intraclass Washington, DC: Food and Nutrition Technical Assistance Project correlation coefficient and the SEM. J Strength Con Res. (2005) 19:231–40. (2004). doi: 10.1519/15184.1 125. Rousson V, Gasser T, Seifert B. Assessing intrarater, interrater and test– Conflict of Interest Statement: The authors declare that the research was retest reliability of continuous measurements. Stat Med. (2002) 21:3431–46. conducted in the absence of any commercial or financial relationships that could doi: 10.1002/sim.1253 be construed as a potential conflict of interest. 126. Churchill GA. A paradigm for developing better measures of marketing constructs. J Mark Res. (1979) 16:64–73. doi: 10.2307/3150876 Copyright © 2018 Boateng, Neilands, Frongillo, Melgar-Quiñonez and Young. This 127. Bland JM, Altman DG. A note on the use of the intraclass is an open-access article distributed under the terms of the Creative Commons correlation coefficient in the evaluation of agreement between two Attribution License (CC BY). The use, distribution or reproduction in other forums methods of measurement. Comput Biol Med. (1990) 20:337–40. is permitted, provided the original author(s) and the copyright owner are credited doi: 10.1016/0010-4825(90)90013-F and that the original publication in this journal is cited, in accordance with accepted 128. Hebert JR, Miller DR. The inappropriateness of conventional use academic practice. No use, distribution or reproduction is permitted which does not of the correlation coefficient in assessing validity and reliability comply with these terms. Frontiers in Public Health | www.frontiersin.org 18 June 2018 | Volume 6 | Article 149 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Frontiers in Public Health Unpaywall

Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer

Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer

Abstract

REVIEW published: 11 June 2018 doi: 10.3389/fpubh.2018.00149 Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer 1 2 3 Godfred O. Boateng *, Torsten B. Neilands , Edward A. Frongillo , 4 1,5 Hugo R. Melgar-Quiñonez and Sera L. Young 1 2 Department of Anthropology and Global Health, Northwestern University, Evanston, IL, United States, Division of Prevention Science, Department of Medicine, University of California, San Francisco, San Francisco, CA, United States, Department of Health Promotion, Education and Behavior, Arnold School of Public Health, University of South Carolina, Columbia, SC, United States, Institute for Global Food Security, School of Human Nutrition, McGill University, Montreal, QC, Canada, Institute for Policy Research, Northwestern University, Evanston, IL, United States Scale development and validation are critical to much of the work in the health, social, and behavioral sciences. However, the constellation of techniques required for scale development and evaluation can be onerous, jargon-filled, unfamiliar, and resource-intensive. Further, it is often not a part of graduate training. Therefore, our goal was to concisely review the process of scale development in as straightforward a manner as possible, both to facilitate the development of new, valid, and reliable Edited by: Jimmy

Loading next page...
 
/lp/unpaywall/best-practices-for-developing-and-validating-scales-for-health-social-xbr3q8neJp

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
ISSN
2296-2565
DOI
10.3389/fpubh.2018.00149
Publisher site
See Article on Publisher Site

Abstract

REVIEW published: 11 June 2018 doi: 10.3389/fpubh.2018.00149 Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer 1 2 3 Godfred O. Boateng *, Torsten B. Neilands , Edward A. Frongillo , 4 1,5 Hugo R. Melgar-Quiñonez and Sera L. Young 1 2 Department of Anthropology and Global Health, Northwestern University, Evanston, IL, United States, Division of Prevention Science, Department of Medicine, University of California, San Francisco, San Francisco, CA, United States, Department of Health Promotion, Education and Behavior, Arnold School of Public Health, University of South Carolina, Columbia, SC, United States, Institute for Global Food Security, School of Human Nutrition, McGill University, Montreal, QC, Canada, Institute for Policy Research, Northwestern University, Evanston, IL, United States Scale development and validation are critical to much of the work in the health, social, and behavioral sciences. However, the constellation of techniques required for scale development and evaluation can be onerous, jargon-filled, unfamiliar, and resource-intensive. Further, it is often not a part of graduate training. Therefore, our goal was to concisely review the process of scale development in as straightforward a manner as possible, both to facilitate the development of new, valid, and reliable Edited by: Jimmy Thomas Efird, scales, and to help improve existing ones. To do this, we have created a primer for University of Newcastle, Australia best practices for scale development in measuring complex phenomena. This is not Reviewed by: a systematic review, but rather the amalgamation of technical literature and lessons Aida Turrini, Consiglio per la Ricerca in Agricoltura learned from our experiences spent creating or adapting a number of scales over the e L’analisi Dell’Economia Agraria past several decades. We identified three phases that span nine steps. In the first phase, (CREA), Italy items are generated and the validity of their content is assessed. In the second phase, Mary Evelyn Northridge, New York University, United States the scale is constructed. Steps in scale construction include pre-testing the questions, *Correspondence: administering the survey, reducing the number of items, and understanding how many Godfred O. Boateng factors the scale captures. In the third phase, scale evaluation, the number of dimensions [email protected] is tested, reliability is tested, and validity is assessed. We have also added examples of Specialty section: best practices to each step. In sum, this primer will equip both scientists and practitioners This article was submitted to to understand the ontology and methodology of scale development and validation, Epidemiology, thereby facilitating the advancement of our understanding of a range of health, social, a section of the journal Frontiers in Public Health and behavioral outcomes. Received: 26 February 2018 Keywords: scale development, psychometric evaluation, content validity, item reduction, factor analysis, tests of Accepted: 02 May 2018 dimensionality, tests of reliability, tests of validity Published: 11 June 2018 Citation: Boateng GO, Neilands TB, INTRODUCTION Frongillo EA, Melgar-Quiñonez HR and Young SL (2018) Best Practices for Scales are a manifestation of latent constructs; they measure behaviors, attitudes, and hypothetical Developing and Validating Scales for scenarios we expect to exist as a result of our theoretical understanding of the world, but cannot Health, Social, and Behavioral assess directly (1). Scales are typically used to capture a behavior, a feeling, or an action that cannot Research: A Primer. be captured in a single variable or item. The use of multiple items to measure an underlying Front. Public Health 6:149. doi: 10.3389/fpubh.2018.00149 latent construct can additionally account for, and isolate, item-specific measurement error, which Frontiers in Public Health | www.frontiersin.org 1 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation leads to more accurate research findings. Thousands of scales have been developed that can measure a range of social, psychological, and health behaviors and experiences. As science advances and novel research questions are put forth, new scales become necessary. Scale development is not, however, an obvious or a straightforward endeavor. There are many steps to scale development, there is significant jargon within these techniques, the work can be costly and time consuming, and complex statistical analysis is often required. Further, many health and behavioral science degrees do not include training on scale development. Despite the availability of a large amount of technical literature on scale theory and development (1–7), there are a number of incomplete scales used to measure mental, physical, and behavioral attributes that are fundamental to our scientific inquiry (8, 9). Therefore, our goal is to describe the process for scale development in as straightforward a manner as possible, both to facilitate the development of new, valid, and reliable scales, and to help improve existing ones. To do this, we have created a primer for best practices for scale development. We anticipate this primer will be broadly applicable across many disciplines, especially for health, social, and behavioral sciences. This is not a systematic review, but rather the amalgamation of technical literature and lessons learned from our experiences spent creating or adapting a number of scales related to multiple disciplines (10–23). First, we provide an overview of each of the nine steps. Then, within each step, we define key concepts, describe the tasks required to achieve that step, share common pitfalls, and draw on examples in the health, social, and behavioral sciences to recommend best practices. We have tried to keep the material as straightforward as possible; references to the body of technical work have been the foundation of this primer. SCALE DEVELOPMENT OVERVIEW There are three phases to creating a rigorous scale—item development, scale development, and scale evaluation (24); these can be further broken down into nine steps (Figure 1). Item development, i.e., coming up with the initial set of questions for an eventual scale, is composed of: (1) identification of the domain(s) and item generation, and (2) consideration of content validity. The second phase, scale development, i.e., Abbreviations: A-CASI, audio computer self-assisted interviewing; ASES, adherence self-efficacy scale; CAPI, computer assisted personal interviewing; CFA, confirmatory factor analysis; CASIC, computer assisted survey information collection builder; CFI, comparative fit index; CTT, classical test theory; DIF, differential item functioning; EFA, exploratory factor analysis; FIML, full FIGURE 1 | An overview of the three phases and nine steps of scale information maximum likelihood; FNE, fear of negative evaluation; G, global development and validation. factor; ICC, intraclass correlation coefficient; ICM, Independent cluster model; IRT, item response theory; ODK, Open Data Kit; PAPI, paper and pen/pencil interviewing; QDS, Questionnaire Development System; RMSEA, root mean square error of approximation; SAD, social avoidance and distress; SAS, statistical turning individual items into a harmonious and measuring analysis systems; SASC-R, social anxiety scale for children revised; SEM, structural construct, consists of (3) pre-testing questions, (4) sampling and equation model; SPSS, statistical package for the social sciences; Stata, statistics survey administration, (5) item reduction, and (6) extraction of and data; SRMR, standardized root mean square residual of approximation; TLI, latent factors. The last phase, scale evaluation, requires: (7) tests Tucker Lewis Index; WASH, water, sanitation, and hygiene; WRMR, weighted root mean square residual. of dimensionality, (8) tests of reliability, and (9) tests of validity. Frontiers in Public Health | www.frontiersin.org 2 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation TABLE 1 | The three phases and nine steps of scale development and validation. Activity Purpose How to explore or estimate? References PHASE 1: ITEM DEVELOPMENT Step 1: Identification of Domain and Item Generation: Selecting Which Items to Ask Domain To specify the boundaries of the domain and facilitate 1.1 Specify the purpose of the domain (1–4), (25) identification item generation 1.2 Confirm that there are no existing instruments 1.3 Describe the domain and provide preliminary conceptual definition 1.4 Specify the dimensions of the domain if they exist a priori 1.5 Define each dimension Item generation To identify appropriate questions that fit the identified 1.6 Deductive methods: literature review and assessment of (2–5), (24–41) domain existing scales 1.7 Inductive methods: exploratory research methodologies including focus group discussions and interviews Step 2: Content Validity: Assessing if the Items Adequately Measure the Domain of Interest Evaluation by To evaluate each of the items constituting the domain for 2.1 Quantify assessments of 5-7 expert judges using formalized (1–5), experts content relevance, representativeness, and technical scaling and statistical procedures including content validity (24, 42–48) quality ratio, content validity index, or Cohen’s coefficient alpha 2.2 Conduct Delphi method with expert judges Evaluation by To evaluate each item constituting the domain for 2.3 Conduct cognitive interviews with end users of scale items to (20, 25) target population representativeness of actual experience from target evaluate face validity population PHASE 2: SCALE DEVELOPMENT Step 3: Pre-testing Questions: Ensuring the Questions and Answers Are Meaningful Cognitive To assess the extent to which questions reflect the 3.1 Administer draft questions to 5–15 interviewees in 2–3 rounds (49–54) interviews domain of interest and that answers produce valid while allowing respondents to verbalize the mental process measurements entailed in providing answers Step 4: Survey Administration and Sample Size: Gathering Enough Data from the Right People Survey To collect data with minimum measurement errors 4.1 Administer potential scale items on a sample that reflects (55–58) administration range of target population using paper or device Establishing the To ensure the availability of sufficient data for scale 4.2 Recommended sample size is 10 respondents per survey (29, 59–65) sample size development item and/or 200-300 observations Determining the To ensure the availability of data for scale development 4.3 Use cross-sectional data for exploratory factor analysis – type of data to use and validation 4.4 Use data from a second time point, at least 3 months later in a longitudinal dataset, or an independent sample for test of dimensionality (Step 7) Step 5: Item Reduction: Ensuring Your Scale Is Parsimonious Item difficulty index To determine the proportion of correct answers given per 5.1 Proportion can be calculated for CTT and item difficulty (1, 2, 66–68) item (CTT) parameter estimated for IRT using statistical packages To determine the probability of a particular examinee correctly answering a given item (IRT) (Continued) Frontiers in Public Health | www.frontiersin.org 3 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation TABLE 1 | Continued Activity Purpose How to explore or estimate? References Item discrimination To determine the degree to which an item or set of test 5.2 Estimate biserial correlations or item discrimination parameter (69–75) test questions are measuring a unitary attribute (CTT) using statistical packages To determine how steeply the probability of correct response changes as ability increases (IRT) Inter-item and To determine the correlations between scale items, as 5.3 Estimate inter-item/item communalities, item-total, and (1, 2, 68, 76) item-total well as the correlations between each item and sum adjusted item-total correlations using statistical packages correlations score of scale items Distractor To determine the distribution of incorrect options and 5.4 Estimate distractor analysis using statistical packages (77–80) efficiency analysis how they contribute to the quality of items Deleting or To ensure the availability of complete cases for scale 5.5 Delete items with many cases that are permanently missing, (81–84) imputing missing development or use multiple imputation or full information maximum cases likelihood for imputation of data Step 6: Extraction of Factors: Exploring the Number of Latent Constructs that Fit Your Observed Data Factor analysis To determine the optimal number of factors or domains 6.1 Use scree plots, exploratory factor analysis, parallel analysis, (2–4), (85–90) that fit a set of items minimum average partial procedure, and/or the Hull method PHASE 3: SCALE EVALUTION Step 7: Tests of Dimensionality: Testing if Latent Constructs Are as Hypothesized Test dimensionality To address queries on the latent structure of scale items 7.1 Estimate independent cluster model—confirmatory factor (91–114) and their underlying relationships. i.e., to validate analysis, cf. Table 2 whether the previous hypothetical structure fits the items 7.2 Estimate bifactor models to eliminate ambiguity about the type of dimensionality—unidimensionality, bidimensionality, or multi-dimensionality 7.3 Estimate measurement invariance to determine whether hypothesized factor and dimension is congruent across groups or multiple samples Score scale items To create scale scores for substantive analysis including 7.4. calculate scale scores using an unweighted approach, which (115) reliability and validity of scale includes summing standardized item scores and raw item scores, or computing the mean for raw item scores 7.5. Calculate scale scores by using a weighted approach, which includes creating factor scores via confirmatory factor analysis or structural equation models Step 8: Tests of Reliability: Establishing if Responses Are Consistent When Repeated Calculate reliability To assess the internal consistency of the scale. i.e., the 8.1 Estimate using Cronbach’s alpha (116–123) statistics degree to which the set of items in the scale co-vary, 8.2. Other tests such as Raykov’s rho, ordinal alpha, and Revelle’s relative to their sum score beta can be used to assess scale reliability Test–retest To assess the degree to which the participant’s 8.3 Estimate the strength of the relationship between scale items (1, 2, 124, reliability performance is repeatable; i.e., how consistent their over two or three time points; variety of measures possible 125) scores are across time Step 9: Tests of Validity: Ensuring You Measure the Latent Dimension You Intended Criterion validity Predictive validity To determine if scores predict future outcomes 9.1 Use bivariate and multivariable regression; stronger and (1, 2, 31) significant associations or causal effects suggest greater predictive validity (Continued) Frontiers in Public Health | www.frontiersin.org 4 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation TABLE 1 | Continued Activity Purpose How to explore or estimate? References Concurrent validity To determine the extent to which scale scores have a 9.2 Estimate the association between scale scores and “gold (2) stronger relationship with criterion measurements made standard” of scale measurement; stronger significant near the time of administration association in Pearson product-moment correlation suggests support for concurrent validity Construct validity Convergent validity To examine if the same concept measured in different 9.3 Estimate the relationship between scale scores and similar (2, 37, 126) ways yields similar results constructs using multi-trait multi-method matrix, latent variable modeling, or Pearson product-moment coefficient; higher/stronger correlation coefficients suggest support for convergent validity Discriminant To examine if the concept measured is different from 9.4 Estimate the relationship between scale scores and distinct (2, 37, 126) validity some other concept constructs using multi-trait multi-method matrix, latent variable modeling, or Pearson product-moment coefficient; lower/weaker correlation coefficients suggest support for discriminant validity Differentiation by To examine if the concept measured behaves as 9.5 Select known binary variables based on theoretical and (2, 126) “known groups” expected in relation to “known groups” empirical knowledge and determine the distribution of the scale scores over the known groups; use t-tests if binary, ANOVA if multiple groups Correlation To determine the relationship between existing measures 9.6 Correlate scale scores and existing measures or, preferably, (2, 127, 128) analysis or variables and newly developed scale scores use linear regression, intraclass correlation coefficient, and analysis of standard deviations of the differences between scores PHASE 1: ITEM DEVELOPMENT Item Generation Once the domain is delineated, the item pool can then be Step 1: Identification of the Domain(s) and identified. This process is also called “question development” Item Generation (26) or “item generation” (24). There are two ways to identify Domain Identification appropriate questions: deductive and inductive methods (24). The first step is to articulate the domain(s) that you are The deductive method, also known as “logical partitioning” endeavoring to measure. A domain or construct refers to the or “classification from above” (27) is based on the description of concept, attribute, or unobserved behavior that is the target of the relevant domain and the identification of items. This can be the study (25). Therefore, the domain being examined should done through literature review and assessment of existing scales be decided upon and defined before any item activity (2). A and indicators of that domain (2, 24). The inductive method, well-defined domain will provide a working knowledge of the also known as “grouping” or “classification from below” (24, 27) phenomenon under study, specify the boundaries of the domain, involves the generation of items from the responses of individuals and ease the process of item generation and content validation. (24). Qualitative data obtained through direct observations and McCoach et al. outline a number of steps in scale development; exploratory research methodologies, such as focus groups and we find the first five to be suitable for the identification of individual interviews, can be used to inductively identify domain domain (4). These are all based on thorough literature review and items (5). include (a) specifying the purpose of the domain or construct It is considered best practice to combine both deductive and you seek to develop, and (b), confirming that there are no inductive methods to both define the domain and identify the existing instruments that will adequately serve the same purpose. questions to assess it. While the literature review provides the Where there is a similar instrument in existence, you need to theoretical basis for defining the domain, the use of qualitative justify why the development of a new instrument is appropriate techniques moves the domain from an abstract point to the and how it will differ from existing instruments. Then, (c) identification of its manifest forms. A scale or construct defined describe the domain and provide a preliminary conceptual by theoretical underpinnings is better placed to make specific definition and (d) specify, if any, the dimensions of the domain. pragmatic decisions about the domain (28), as the construct will Alternatively, you can let the number of dimensions forming be based on accumulated knowledge of existing items. the domain to be determined through statistical computation It is recommended that the items identified using deductive (cf. Steps 5, 6, and 7). Domains are determined a priori if there and inductive approaches should be broader and more is an established framework or theory guiding the study, but a comprehensive than one’s own theoretical view of the target posteriori if none exist. Finally, if domains are identified a priori, (28, 29). Further, content should be included that ultimately will (e) the final conceptual definition for each domain should be be shown to be tangential or unrelated to the core construct. specified. In other words, one should not hesitate to have items on the Frontiers in Public Health | www.frontiersin.org 5 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation scale that do not perfectly fit the domain identified, as successive than five categories are best estimated using robust categorical evaluation will eliminate undesirable items from the initial pool. methods. However, items with five to seven categories without Kline and Schinka et al. note that the initial pool of items strong floor or ceiling effects can be treated as continuous items developed should be at minimum twice as long as the desired final in confirmatory factor analysis and structural equation modeling scale (26, 30). Others have recommended the initial pool to be five using maximum likelihood estimations (34). times as large as the final version, to provide the requisite margin One pitfall in the identification of domain and item to select an optimum combination of items (30). We agree with generation is the improper conceptualization and definition of Kline and Schinka et al. (26, 30) that the number of items should the domain(s). This can result in scales that may either be be at least twice as long as the desired scale. deficient because the definition of the domain is ambiguous Further, in the development of items, the form of the items, or has been inadequately defined (35). It can also result in the wording of the items, and the types of responses that the contamination, i.e., the definition of the domain overlaps with question is designed to induce should be taken into account. other existing constructs in the same field (35). It also means questions should capture the lived experiences Caution should also be taken to avoid construct of the phenomenon by target population (30). Further, items underrepresentation, which is when a scale does not capture should be worded simply and unambiguously. Items should not important aspects of a construct because its focus is too narrow be offensive or potentially biased in terms of social identity, (35, 36). Further, construct-irrelevant variance, which is the i.e., gender, religion, ethnicity, race, economic status, or sexual degree to which test scores are influenced by processes that have orientation (30). little to do with the intended construct and seem to be widely Fowler identified five essential characteristics of items inclusive of non-related items (36, 37), should be avoided. Both required to ensure the quality of construct measurement (31). construct underrepresentation and irrelevant variance can lead These include (a) the need for items to be consistently to the invalidation of the scale (36). understood; (b) the need for items to be consistently An example of best practice using the deductive approach to administered or communicated to respondents; (c) the consistent item generation is found in the work of Dennis on breastfeeding communication of what constitutes an adequate answer; (d) self-efficacy (38–40). Dennis’ breastfeeding self-efficacy scale the need for all respondents to have access to the information items were first informed by Bandura’s theory on self-efficacy, needed to answer the question accurately; and (e) the willingness followed by content analysis of literature review, and empirical for respondents to provide the correct answers required by the studies on breastfeeding-related confidence. question at all times. A valuable example for a rigorous inductive approach is found These essentials are sometimes very difficult to achieve. in the work of Frongillo and Nanama on the development and Krosnick (32) suggests that respondents can be less thoughtful validation of an experience-based measure of household food about the meaning of a question, search their memories less insecurity in northern Burkina Faso (41). In order to generate comprehensively, integrate retrieved information less carefully, items for the measure, they undertook in-depth interviews with or even select a less precise response choice. All this means 10 household heads and 26 women using interview guides. The that they are merely satisficing, i.e., providing merely satisfactory data from these interviews were thematically analyzed, with the answers, rather than the most accurate ones. In order to combat results informing the identification of items to be added or this behavior, questions should be kept simple, straightforward, deleted from the initial questionnaire. Also, the interviews led to and should follow the conventions of normal conversation. the development and revision of answer choices. With regards to the type of responses to these questions, we recommend that questions with dichotomous response Step 2: Content Validity categories (e.g., true/false) should have no ambiguity. When a Content validity, also known as “theoretical analysis” (5), refers Likert-type response scale is used, the points on the scale should to the “adequacy with which a measure assesses the domain reflect the entire measurement continuum. Responses should of interest” (24). The need for content adequacy is vital if the be presented in an ordinal manner, i.e., in an ascending order items are to measure what they are presumed to measure (1). without any overlap, and each point on the response scale should Additionally, content validity specifies content relevance and be meaningful and interpreted the same way by each participant content representations, i.e., that the items capture the relevant to ensure data quality (33). experience of the target population being examined (129). In terms of the number of points on the response scale, Content validity entails the process of ensuring that only Krosnick and Presser (33) showed that responses with just two the phenomenon spelled out in the conceptual definition, but to three points have lower reliability than Likert-type response not other aspects that “might be related but are outside the scales with five to seven points. However, the gain levels off investigator’s intent for that particular [construct] are added” after seven points. Therefore, response scales with five points are (1). Guion has proposed five conditions that must be satisfied recommended for unipolar items, i.e., those reflecting relative in order for one to claim any form of content validity. We find degrees of a single item response quality, e.g., not at all satisfied to these conditions to be broadly applicable to scale development very satisfied. Seven response items are recommended for bipolar in any discipline. These include that (a) the behavioral content items, i.e., those reflecting relative degrees of two qualities of an has a generally accepted meaning or definition; (b) the domain item response scale, e.g., completely dissatisfied to completely is unambiguously defined; (c) the content domain is relevant to satisfied. As an analytic aside, items with scale points fewer the purposes of measurement; (d) qualified judges agree that the Frontiers in Public Health | www.frontiersin.org 6 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation domain has been adequately sampled based on consensus; and An example of the concurrent use of expert and target (e) the response content must be reliably observed and evaluated population judges comes from Boateng et al.’s work to develop (42). Therefore, content validity requires evidence of content a household-level water insecurity scale appropriate for use in relevance, representativeness, and technical quality. western Kenya (20). We used the Delphi method to obtain three Content validity is mainly assessed through evaluation by rounds of feedback from international experts including those expert and target population judges. in hydrology, geography, WASH and water-related programs, policy implementation, and food insecurity. Each of the three rounds was interspersed with focus group discussions with our Evaluation by Experts target population, i.e., people living in western Kenya. In each Expert judges are highly knowledgeable about the domain of round, the questionnaires progressively became more closed interest and/or scale development; target population judges are ended, until consensus was attained on the definition of the potential users of the scale (1, 5). Expert judges seem to be used domain we were studying and possible items we could use. more often than target-population judges in scale development work to date. Ideally, one should combine expert and target population judgment. When resources are constrained, however, PHASE 2: SCALE DEVELOPMENT we recommend at least the use of expert judges. Expert judges evaluate each of the items to determine whether Step 3: Pre-testing Questions they represent the domain of interest. These expert judges Pre-testing helps to ensure that items are meaningful to the should be independent of those who developed the item pool. target population before the survey is actually administered, i.e., Expert judgment can be done systematically to avoid bias in the it minimizes misunderstanding and subsequent measurement assessment of items. Multiple judges have been used (typically error. Because pre-testing eliminates poorly worded items and ranging from 5 to 7) (25). Their assessments have been quantified facilitates revision of phrasing to be maximally understood, it also using formalized scaling and statistical procedures such as the serves to reduce the cognitive burden on research participants. content validity ratio for quantifying consensus (43), content Finally, pre-testing represents an additional way in which validity index for measuring proportional agreement (44), or members of the target population can participate in the research Cohen’s coefficient kappa (k) for measuring inter-rater or expert process by contributing their insights to the development of the agreement (45). Among the three procedures, we recommend survey. Cohen’s coefficient kappa, which has been found to be most Pre-testing has two components: the first is the examination efficient (46). Additionally, an increase in the number of experts of the extent to which the questions reflect the domain has been found to increase the robustness of the ratings (25, 44). being studied. The second is the examination of the extent Another way by which content validity can be assessed to which answers to the questions asked produce valid through expert judges is by using the Delphi method to come to a measurements (31). consensus on which questions are a reflection of the construct you want to measure. The Delphi method is a technique “for Cognitive Interviews structuring group communication process so that the process is To evaluate whether the questions reflect the domain of effective in allowing a group of individuals, as a whole, to deal study and meet the requisite standards, techniques including with a complex problem” (47). cognitive interviews, focus group discussion, and field pre-testing A good example of evaluation of content validity using expert under realistic conditions can be used. We describe the most judges is seen in the work of Augustine et al. on adolescent recommended, which is cognitive interviews. knowledge of micronutrients (48). After identifying a list of items Cognitive interviewing entails the administration of draft to be validated, the authors consulted experts in the field of survey questions to target populations and then asking the nutrition, psychology, medicine, and basic sciences. The items respondents to verbalize the mental process entailed in providing were then subjected to content analysis using expert judges. Two such answers (49). Generally, cognitive interviews allow for independent reviews were carried out by a panel of five experts questions to be modified, clarified, or augmented to fit the to select the questions that were appropriate, accurate, and objectives of the study. This approach helps to determine whether interpretable. Items were either accepted, rejected, or modified the question is generating the information that the author intends based on majority opinion (48). by helping to ensure that respondents understand questions as developers intended and that respondents are able to answer Evaluation by Target Population in a manner that reflects their experience (49, 50). This can Target population judges are experts at evaluating face validity, be done on a sample outside of the study population or on a which is a component of content validity (25). Face validity is subset of study participants, but it must be explored before the the “degree that respondents or end users [or lay persons] judge questionnaire is finalized (51, 52). that the items of an assessment instrument are appropriate to the The sample used for cognitive interviewing should capture the targeted construct and assessment objectives” (25). These end- range of demographics you anticipate surveying (49). A range users are able to tell whether the construct is a good measure of 5–15 interviews in two to three rounds, or until saturation, of the domain through cognitive interviews, which we discuss in or relatively few new insights emerge is considered ideal for Step 3. pre-testing (49, 51, 52). Frontiers in Public Health | www.frontiersin.org 7 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation In sum, cognitive interviews get to the heart of both assessing as sexual behaviors and substance use, compared to when being the appropriateness of the question to the target population asked by another human. and the strength of the responses (49). The advantages of On the other hand, paper forms may avert the crisis of losing using cognitive interviewing include: (a) it ensures questions are data if the software crashes, the devices are lost or stolen prior producing the intended data, (b) questions that are confusing to being backed up, and may be more suitable in areas that to participants are identified and improved for clarity, (c) have irregular electricity and/or internet. However, as sample problematic questions or questions that are difficult to answer sizes increase, the use of PAPI becomes more expensive, time are identified, (d) it ensures response options are appropriate and labor intensive, and the data are exposed in several ways to and adequate, (e) it reveals the thought process of participants human error (57, 58). Based on the merits of CAPI over PAPI, we on domain items, and (f) it can indicate problematic question recommend researchers use CAPI in data collection for surveys order (52, 53). Outcomes of cognitive interviews should when feasible. always be reported, along with solutions used to remedy the situation. Establishing the Sample Size An example of best practice in pre-testing is seen in the work The sample size to use for the development of a latent construct of Morris et al. (54). They developed and validated a novel scale has often been contentious. It is recommended that potential for measuring interpersonal factors underlying injection drug use scale items be tested on a heterogeneous sample, i.e., a sample that behaviors among injecting partners. After item development and both reflects and captures the range of the target population (29). expert judgment, they conducted cognitive interviews with seven For example, when the scale is used in a clinical setting, Clark and respondents with similar characteristics to the target population Watson recommend using patient samples early on instead of a to refine and assess item interpretation and to finalize item sample from the general population (29). structure. Eight items were dropped after cognitive interviews for The necessary sample size is dependent on several aspects lack of clarity or importance. They also made modifications to of any given study, including the level of variation between the grammar, word choice, and answer options based on the feedback variables, and the level of over-determination (i.e., the ratio of from cognitive interviews. variables to number of factors) of the factors (59). The rule of thumb has been at least 10 participants for each scale item, i.e., an ideal ratio of respondents to items is 10:1 (60). However, others Step 4: Survey Administration and Sample have suggested sample sizes that are independent of the number Size of survey items. Clark and Watson (29) propose using 300 Survey Administration respondents after initial pre-testing. Others have recommended Collecting data with minimum measurement errors from a range of 200–300 as appropriate for factor analysis (61, 62). an adequate sample size is imperative. These data can be Based on their simulation study using different sample sizes, collected using paper and pen/pencil interviewing (PAPI) or Guadagnoli and Velicer (61) suggested that a minimum of Computer Assisted Personal Interviewing (CAPI) on devices 300–450 is required to observe an acceptable comparability of like laptops, tablets, or phones. A number of software patterns, and that replication is required if the sample size is programs exist for building forms on devices. These include <300. Comrey and Lee suggest a graded scale of sample sizes for Computer Assisted Survey Information Collection (CASIC) scale development: 100 = poor, 200 = fair, 300 = good, 500 = TM Builder (West Portal Software Corporation, San Francisco, very good, ≥1,000 = excellent (63). Additionally, item reduction TM CA); Qualtrics Research Core (www.qualtrics.com); Open procedures (described, below in Step 5), such as parallel analysis Data Kit (ODK, https://opendatakit.org/); Research Electronic which requires bootstrapping (estimating statistical parameters Data Capture (REDCap) (55); SurveyCTO (Dobility, Inc. from sample by means of resampling with replacement) (64), https://www.surveycto.com); and Questionnaire Development may require larger data sets. TM System (QDS, www.novaresearch.com), which allows the In sum, there is no single item-ratio that works for all survey participant to report sensitive audio data. development scenarios. A larger sample size or respondent: Each approach has advantages and drawbacks. Using item ratio is always better, since a larger sample size implies technology can reduce the errors associated with data entry, lower measurement errors and more, stable factor loadings, allow the collection of data from large samples with minimal replicable factors, and generalizable results to the true population cost, increase response rate, reduce enumerator errors, permit structure (59, 65). A smaller sample size or respondent: item ratio instant feedback, and increase monitoring of data collection may mean more unstable loadings and factors, random, non- and ability to get more confidential data (56–58, 130). A subset replicable factors, and non-generalizable results (59, 65). Sample of technology-based programs offers the option of attaching size is, however, always constrained by resources available, and audio files to the survey questions so that questions may be more often than not, scale development can be difficult to fund. recorded and read out loud to participants with low literacy via audio computer self-assisted interviewing (A-CASI) (131). Determining the Type of Data to Use Self-interviewing, whether via A-CASI or via computer-assisted The development of a scale minimally requires data from a single personal interviewing, in which participants read and respond point in time. To fully test for the reliability of the scale (cf. Steps to questions on a computer without interviewer involvement, 8, 9), however, either an independent dataset or a subsequent may increase reports of sensitive or stigmatized behaviors such time point is necessary. Data from longitudinal studies can be Frontiers in Public Health | www.frontiersin.org 8 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation used for initial scale development (e.g., from baseline), to conduct inter-item and item-total correlations, which are mostly used for confirmatory factor analysis (using follow-up data, cf. Step 7), categorical items; and distractor efficiency analysis for items with and to assess test–retest reliability (using baseline and follow- multiple choice response options (1, 2). up data). The problem with using longitudinal data to test Item Difficulty Index hypothesized latent structures is common error variance, since the same, potentially idiosyncratic, participants will be involved. The item difficulty index is both a CTT and an IRT parameter that To give the most credence to the reliability of scale, the ideal can be traced largely to educational and psychological testing to procedure is to develop the scale on sample A, whether cross- assess the relative difficulties and discrimination abilities of test sectional or longitudinal, and then test it on an independent items (66). Subsequently, this approach has been applied to more sample B. attitudinal-type scales designed to measure latent constructs. The work of Chesney et al. on the Coping Self-Efficacy Under the CTT framework, the item difficulty index, also scale provides an example of this best practice in the use of called item easiness, is the proportion of correct answers on a independent samples (132). This study sought to investigate the given item, e.g., the proportion of correct answers on a math psychometric characteristics of the Coping Self-Efficacy (CSE) test (1, 2). It ranges between 0.0 and 1.0. A high difficulty score scale, and their samples came from two independent randomized means a greater proportion of the sample answered the question clinical trials. As such, two independent samples with four correctly. A lower difficulty score means a smaller proportion of different time points each (0, 3, 6, and 12 months) were used. the sample understood the question and answered correctly. This The authors administered the 26-item scale to the sample from may be due to the item being coded wrongly, ambiguity with the the first clinical trial and examined the covariance that existed item, confusing language, or ambiguity with response options. A between all the scale items (exploratory factor analysis) giving the lower difficulty score suggests a need to modify the items or delete hypothesized factor structure across time in that one trial. The them from the pool of items. obtained factor structure was then fitted to baseline data from the Under the IRT framework, the item difficulty parameter is second randomized clinical trial to test the hypothesized factor the probability of a particular examinee correctly answering any structure generated in the first sample (132). given item (67). This has the advantage of allowing the researcher to identify the different levels of individual performance on Step 5: Item Reduction Analysis specific questions, as well as develop particular questions In scale development, item reduction analysis is conducted to specific subgroups or populations (67). Item difficulty is to ensure that only parsimonious, functional, and internally estimated directly using logistic models instead of proportions. consistent items are ultimately included (133). Therefore, the goal Researchers must determine whether they need items with of this phase is to identify items that are not or are the least- low, medium, or high difficulty. For instance, researchers related to the domain under study for deletion or modification. interested in general purpose scales will focus on items with Two theories, Classical Test Theory (CTT) and the Item medium difficulty (68), i.e., the proportion with item assertions Response Theory (IRT), underpin scale development (134). CTT ranging from 0.4 to 0.6 (2, 68). The item difficulty index can be is considered the traditional test theory and IRT the modern test calculated using existing commands in Mplus, R, SAS, SPSS, or theory; both function to produce latent constructs. Each theory Stata. may be used singly or in conjunction to complement the other’s strengths (15, 135). Whether the researcher is using CTT or IRT, Item Discrimination Index the primary goal is to obtain functional items (i.e., items that are The item discrimination index (also called item-effectiveness correlated with each other, discriminate between individual cases, test), is the degree to which an item correctly differentiates underscore a single or multidimensional domain, and contribute between respondents or examinees on a construct of interest (69), significantly to the construct). and can be assessed under both CTT and IRT frameworks. It CTT allows the prediction of outcomes of constructs and the is a measure of the difference in performance between groups difficulty of items (136). CTT models assume that items forming on a construct. The upper group represents participants with constructs in their observed, manifest forms consist of a true high scores and the lower group those with poor or low scores. score on the domain of interest and a random error (which is The item discrimination index is “calculated by subtracting the the differences between the true score and a set of observed proportion of examinees in the lower group (lower %) from the scores by an individual) (137). IRT seeks to model the way in proportion of examinees in the upper group (upper %) who got which constructs manifest themselves in terms of observable the item correct or endorsed the item in the expected manner” item response (138). Comparatively, the IRT approach to scale (69). It differentiates between the number of students in an upper development has the advantage of allowing the researcher to group who get an item correct and the number of students in determine the effect of adding or deleting a given item or set a lower group who get the item correct (70). The use of an of items by examining the item information and standard error item discrimination index enables the identification of positively functions for the item pool (138). discriminating items (i.e., items that differentiate rightly between Several techniques exist within the two theories to reduce those who are knowledgeable about a subject and those who are the item pool, depending on which test theory is driving the not), negatively discriminating items (i.e., items which are poorly scale. The five major techniques used are: item difficulty and item designed such that the more knowledgeable get them wrong and discrimination indices, which are primarily for binary responses; the less knowledgeable get them right), and non-discriminating Frontiers in Public Health | www.frontiersin.org 9 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation item (i.e., items that fail to differentiate between participants who the tentative scale. Inter-item and item total correlations can be are knowledgeable about a subject and those who are not) (70). calculated using Mplus, R, SAS, SPSS, or Stata. The item discrimination index has been found to improve Distractor Efficiency Analysis test items in at least three ways. First, non-discriminating items, The distractor efficiency analysis shows the distribution of which fail to discriminate between respondents because they incorrect options and how they contribute to the quality of a may be too easy, too hard, or ambiguous, should be removed (71). Second, items which negatively discriminate, e.g., items multiple-choice item (77). The incorrect options, also known as distractors, are intentionally added in the response options to which fail to differentiate rightly between medically diagnosed depressed and non-depressed respondents on a happiness scale, attract students who do not know the correct answer in a test question (78). To calculate this, respondents will be grouped should be reexamined and modified (70, 71). Third, items that into three groups—high, middle, and lower tertiles based on positively discriminate should be retained, e.g., items that are their total scores on a set of items. Items will be regarded as correctly affirmed by a greater proportion of respondents who appropriate if 100% of those in the high group choose the are medically free of depression, with very low affirmation correct response options, about 50% of those in the middle by respondents diagnosed to be medically depressed (71). In choose the correct option, and few or none in the lower group some cases, it has been recommended that such positively choose the correct option (78). This type of analysis is rarely discriminating items be considered for revision (70) as the used in the health sciences, as most multiple-choice items are differences could be due to the level of difficulty of the item. An item discrimination index can be calculated through on a Likert-type response scale and do not test respondent correct knowledge, but their experience or perception. However, correlational analysis between the performance on an item and an overall criterion (69) using either the point biserial correlation distractor analysis can help to determine whether items are well-constructed, meaningful, and functional when researchers coefficient or the phi coefficient (72). add response options to questions that do not fit a particular Item discrimination under the IRT framework is a slope experience. It is expected that participants who are determined parameter that determines how steeply the probability of a as having poor knowledge or experience on the construct will correct response changes as the proficiency or trait increases choose the distractors, while those with the right knowledge (73). This allows differentiation between individuals with similar and experience will choose the correct response options (77, 79). abilities and can also be estimated using a logistic model. Under Where those with the right knowledge and experience are not certain conditions, the biserial correlation coefficient under the able to differentiate between distractors and the right response, CTT framework has proven to be identical to the IRT item the question may have to be modified. Non-functional distractors discrimination parameter (67, 74, 75); thus, as the trait increases identified need to be removed and replaced with efficient so does the probability of endorsing an item. These parameters can be computed using existing commands in Mplus, R, SAS, distractors (80). SPSS, or Stata. In both CTT and IRT, higher values are indicators Missing Cases of greater discrimination (73). In addition to these techniques, some researchers opt to delete items with large numbers of cases that are missing, when other missing data-handling techniques cannot be used (81). For cases Inter-item and Item-Total Correlations where modern missing data handling can be used, however, A third technique to support the deletion or modification of items several techniques exist to solve the problem of missing cases. is the estimation of inter-item and item-total correlations, which Two of the approaches have proven to be very useful for scale falls under CTT. These correlations often displayed in the form development: full information maximum likelihood (FIML) (82) of a matrix are used to examine relationships that exist between and multiple imputation (83). Both methods can be applied using individual items in a pool. existing commands in statistical packages such as Mplus, R, SAS, Inter-item correlations (also known as polychoric correlations and Stata. When using multiple imputation to recover missing for categorical variables and tetrachoric correlations for binary data in the context of survey research, the researcher can impute items) examines the extent to which scores on one item are individual items prior to computing scale scores or impute the related to scores on all other items in a scale (2, 68, 76). Also, scale scores from other scale scores (84). However, item-level it examines the extent to which items on a scale are assessing the imputation has been shown to produce more efficient estimates same content (76). Items with very low correlations (<0.30) are over scale-level imputation. Thus, imputing individual items less desirable and could be a cue for potential deletion from the before scale development is a preferred approach to imputing tentative scale. newly developed scales for missing cases (84). Item-total correlations (also known as polyserial correlations for categorical variables and biserial correlations for binary items) aim at examining the relationship between each item vs. the total Step 6: Extraction of Factors score of scale items. However, the adjusted item-total correlation, Factor extraction is the phase in which the optimal number of which examines the correlation between the item and the sum factors, sometimes called domains, that fit a set of items are score of the rest of the items excluding itself is preferred (1, 2). determined. This is done using factor analysis. Factor analysis Items with very low adjusted item-total correlations (<0.30) are is a regression model in which observed standardized variables less desirable and could be a cue for potential deletion from are regressed on unobserved (i.e., latent) factors. Because the Frontiers in Public Health | www.frontiersin.org 10 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation variables and factors are standardized, the bivariate regression restrictive ICM, in which cross-loadings between items and non- coefficients are also correlations, representing the loading of each target factors are assumed to be exactly zero. The systematic observed variable on each factor. Thus, factor analysis is used to fit assessment procedures are determined by meaningful understand the latent (internal) structure of a set of items, and the satisfactory thresholds; Table 2 contains the most common extent to which the relationships between the items are internally techniques for testing dimensionality. These techniques include consistent (4). This is done by extracting latent factors which the chi-square test of exact fit, Root Mean Square Error of represent the shared variance in responses among the multiple Approximation (RMSEA ≤ 0.06), Tucker Lewis Index (TLI ≥ items (4). The emphasis is on the number of factors, the salience 0.95), Comparative Fit Index (CFI ≥ 0.95), Standardized Root of factor loading estimates, and the relative magnitude of residual Mean Square Residual (SRMR ≤ 0.08), and Weighted Root Mean variances (2). Square Residual (WRMR ≤ 1.0) (90, 92–101). A number of analytical processes have been used to determine the number of factors to retain from a list of items, and it is Bifactor Modeling beyond the scope of this paper to describe all of them. For Bifactor modeling, also referred to as nested factor modeling, is scale development, commonly available methods to determine a form of item response theory used in testing dimensionality the number of factors to retain include a scree plot (85), the of a scale (102, 103). This method can be used when the variance explained by the factor model, and the pattern of factor hypothesized factor structure from the previous model produces loadings (2). Where feasible, researchers could also assess the partially overlapping dimensions so that one could be seeing optimal number of factors to be drawn from the list of items using most of the items loading onto one factor and a few items either parallel analysis (86), minimum average partial procedure loading onto a second and/or a third factor. The bifactor (87), or the Hull method (88, 89). model allows researchers to estimate a unidimensional construct The extraction of factors can also be used to reduce items. while recognizing the multidimensionality of the construct (104, With factor analysis, items with factor loadings or slope 105). The bifactor model assumes each item loads onto two coefficients that are below 0.30 are considered inadequate as dimensions, i.e., items forming the construct may be associated they contribute <10% variation of the latent construct measured. with more than one source of true score variance (92). The Hence, it is often recommended to retain items that have factor first is a general latent factor that underlies all the scale items loadings of 0.40 and above (2, 60). Also, items with cross-loadings and the second, a group factor (subscale). A “bifactor model or that appear not to load uniquely on individual factors can be is based on the assumption that a f -factor solution exists for a deleted. For single-factor models in which Rasch IRT modeling is set of n items with one [general]/Global (G) factor and f – 1 used, items are selected as having a good fit based on mean-square Specific (S) factors also called group factors” (92). This approach residual summary statistics (infit and outfit) >0.4 and <1.6 (90). allows researchers to examine any distortion that may occur A number of scales developed stop at this phase and jump when unidimensional IRT models are fit to multidimensional to tests of reliability, but the factors extracted at this point only data (104, 105). To determine whether to retain a construct as provide a hypothetical structure of the scale. The dimensionality unidimensional or multidimensional, the factor loadings from of these factors need to be tested (cf. Step 7) before moving on to the general factor are then compared to those from the group reliability (cf. Step 8) and validity (cf. Step 9) assessment. factors (103, 106). Where the factor loadings on the general factor are significantly larger than the group factors, a unidimensional scale is implied (103, 104). This method is assessed based on meaningful satisfactory thresholds. Alternatively, one can test for PHASE 3: SCALE EVALUATION the coexistence of a general factor that underlies the construct Step 7: Tests of Dimensionality and multiple group factors that explain the remaining variance The test of dimensionality is a test in which the hypothesized not explained by the general factor (92). Each of these methods factors or factor structure extracted from a previous model is can be done using statistical software such as Mplus, R, SAS, SPSS, tested at a different time point in a longitudinal study or, ideally, or Stata. on a new sample (91). Tests of dimensionality determine whether the measurement of items, their factors, and function are the Measurement Invariance same across two independent samples or within the same sample Another method to test dimensionality is measurement at different time points. Such tests can be conducted using invariance, also referred to as factorial invariance or independent cluster model (ICM)-confirmatory factor analysis, measurement equivalence (107). Measurement invariance bifactor modeling, or measurement invariance. concerns the extent to which the psychometric properties of the observed indicators are transportable (generalizable) Confirmatory Factor Analysis across groups or over time (108). These properties include the Confirmatory factor analysis is a form of psychometric hypothesized factor structure, regression slopes, intercept, and assessment that allows for the systematic comparison of an residual variances. Measurement invariance is tested sequentially alternative a priori factor structure based on systematic fit at five levels—configural, metric, scalar, strict (residual), assessment procedures and estimates the relationship between and structural (107, 109). Of key significance to the test of latent constructs, which have been corrected for measurement dimensionality is configural invariance, which is concerned with errors (92). Morin et al. (92) note that it relies on a highly whether the hypothesized factor structure is the same across Frontiers in Public Health | www.frontiersin.org 11 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation TABLE 2 | Description of model fit indices and thresholds for evaluating scales developed for health, social, and behavioral research. Model fit indices Description Recommended threshold to use References Chi-square test The chi-square value is a test statistic of the goodness of Chi-square test of model fit has been assessed to be (2, 93) fit of a factor model. It compares the observed overly sensitive to sample size and to vary when dealing covariance matrix with a theoretically proposed with non-normal variables. Hence, the use of non-normal covariance matrix data, a small sample size (n =180–300), and highly correlated items make the chi-square approximation inaccurate. An alternative to this is to use the Satorra-Bentler scaled (mean-adjusted) difference chi-squared statistic. The DIFFTEST has been recommended for models with binary and ordinal variables Root Mean Squared RMSEA is a measure of the estimated discrepancy Browne and Cudeck recommend RMSEA ≤ 0.05 as (26, 96–100) Error of Approximation between the population and model-implied population indicative of close fit, 0.05 ≤ RMSEA ≤ 0.08 as (RMSEA) covariance matrices per degree of freedom (139). indicative of fair fit, and values >0.10 as indicative of poor fit between the hypothesized model and the observed data. However, Hu and Bentler have suggested RMSEA ≤ 0.06 may indicate a good fit Tucker Lewis Index TLI is based on the idea of comparing the proposed Bentler and Bonnett suggest that models with overall fit (95–98) (TLI) factor model to a model in which no interrelationships at indices of <0.90 are generally inadequate and can be all are assumed among any of the items improved substantially. Hu and Bentler recommend TLI ≥ 0.95 Comparative Fit Index CFI is an incremental relative fit index that measures the CFI ≥ 0.95 is often considered an acceptable fit (95–98) (CFI) relative improvement in the fit of a researcher’s model over that of a baseline model Standardized Root SRMR is a measure of the mean absolute correlation Threshold for acceptable model fit is SRMR ≤ 0.08 (95–98) Mean Square Residual residual, the overall difference between the observed and (SRMR) predicted correlations Weighted Root Mean WRMR uses a “variance-weighted approach especially Yu recommends a threshold of WRMR <1.0 for (101) Square Residual suited for models whose variables measured on different assessing model fit. This index is used for confirmatory (WRMR) scales or have widely unequal variances” (139); it has factor analysis and structural equation models with been assessed to be most suitable in assessing models binary and ordinal variables fitted to binary and ordinal data Standard of Reliability A reliability of 0.90 is the minimum recommended Nunnally recommends a threshold of ≥0.90 for (117, 123) for scales threshold that should be tolerated while a reliability of assessing internal consistency for scales 0.95 should be the desirable standard. While the ideal has rarely been attained by most researchers, a reliability coefficient of 0.70 has often been accepted as satisfactory for most scales groups. This assumption has to be met in order for subsequent indices and the strength of factor loadings (cf. Table 2) are tests to be meaningful (107, 109). For example, a hypothesized the basis on which the latent structure of the items can be unidimensional structure, when tested across multiple countries, judged. should be the same. This can be tested in CTT, using multigroup One commonly encountered pitfall is a lack of satisfactory confirmatory factor analysis (110–112). global model fit in confirmatory factor analysis conducted on An alternative approach to measurement invariance in the a new sample following a satisfactory initial factor analysis testing of unidimensionality under item response theory is the performed on a previous sample. Lack of satisfactory fit offers Rasch measurement model for binary items and polytomous IRT the opportunity to identify additional underperforming items models for categorical items. Here, emphasis is on testing the for removal. Items with very poor loadings (≤0.3) can be differential item functioning (DIF)—an indicator of whether “a considered for removal. Also, modification indices, produced group of respondents is scoring better than another group of by Mplus and other structural equation modeling (SEM) respondents on an item or a test after adjusting for the overall programs, can help identify items that need to be modified. ability scores of the respondents” (108, 113). This is analogous Sometimes a higher-order factor structure, where correlations to the conditions underpinning measurement invariance in a among the original factors can be explained by one or more multi-group CFA (108, 113). higher-order factors, is needed. This can also be assessed Whether the hypothesized structure is bidimensional or using statistical software such as Mplus, R, SAS, SPSS, or multidimensional, each dimension in the structure needs to be Stata. tested again to confirm its unidimensionality. This can also be A good example of best practice is seen in the work of done using confirmatory factor analysis. Appropriate model fit Pushpanathan et al. on the appropriateness of using a traditional Frontiers in Public Health | www.frontiersin.org 12 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation confirmatory factor analysis or a bifactor model (114) in assessing however, 0.80 and 0.95 is preferred for the psychometric quality whether the Parkinson’s Disease Sleep Scale-Revised was better of scales (60, 117, 123). Cronbach’s alpha has been the most used as a unidimensional scale, a tri-dimensional scale, or a scale common and seems to have received general approval; however, that has an underlying general factor and three group factors reliability statistics such as Raykov’s rho, ordinal alpha, and (sub-scales). They tested this using three different models— Revelle’s beta, which are debated to have improvements over a unidimensional model (1-factor CFA); a 3-factor model (3 Cronbach’s alpha, are beginning to gain acceptance. factor CFA) consisting of sub-scales measuring insomnia, motor Test–Retest Reliability symptoms and obstructive sleep apnea, and REM sleep behavior disorder; and a confirmatory bifactor model having a general An additional approach in testing reliability is the test–retest reliability. The test–retest reliability, also known as the coefficient factor and the same three sub-scales combined. The results of this of stability, is used to assess the degree to which the participants’ study suggested that only the bifactor model with a general factor performance is repeatable, i.e., how consistent their sum scores and the three sub-scales combined achieved satisfactory model are across time (2). Researchers vary in how they assess test– fitness. Based on these results, the authors cautioned against the retest reliability. While some prefer to use intra class correlation use of a unidimensional total scale scores as a cardinal indicator coefficient (124), others use the Pearson product-moment of sleep in Parkinson’s disease, but encouraged the examination correlation (125). In both cases, the higher the correlation, of its multidimensional subscales (114). the higher the test–retest reliability, with values close to zero Scoring Scale Items indicating low reliability. In addition, study conditions could Finalized items from the tests of dimensionality can be used change values on the construct being measured over time (as to create scale scores for substantive analysis including tests in an intervention study, for example), which could lower the of reliability and validity. Scale scores can be calculated by test-retest reliability. using unweighted or weighted procedures. The unweighted The work of Johnson et al. (16) on the validation of the approach involves summing standardized item scores or raw item HIV Treatment Adherence Self-Efficacy Scale (ASES) is a good scores, or computing the mean for raw item scores (115). The example of the test of reliability. As part of testing for reliability, weighted approach in calculating scale scores can be produced the authors tested for the internal consistency reliability values via statistical software programs such as Mplus, R, SAS, SPSS, for the ASES and its subscales using Raykov’s rho (produces or Stata. For instance, in using confirmatory factor analysis, a coefficient similar to alpha but with fewer assumptions and structural equation models, or exploratory factor analysis, each with confidence intervals); they then tested for the temporal factor produced reveals a statistically independent source of consistency of the ASES’ factor structure. This was then followed variation among a set of items (115). The contribution of by test–retest reliability assessment among the latent factors. The each individual item to this factor is considered a weight, different approaches provided support for the reliability of the with the factor loading value representing the weight. The ASES scale. scores associated with each factor in a model then represents a Other approaches found to be useful and support scale composite scale score based on a weighted sum of the individual reliability include split-half estimates, Spearman-Brown formula, items using factor loadings (115). In general, it does not make alternate form method (coefficient of equivalence), and inter- much difference in the performance of the scale if scales are observer reliability (1, 2). computed as unweighted items (e.g., mean or sum scores) or weighted items (e.g., factor scores). Step 9: Tests of Validity Scale validity is the extent to which “an instrument indeed Step 8: Tests of Reliability measures the latent dimension or construct it was developed Reliability is the degree of consistency exhibited when a to evaluate” (2). Although it is discussed at length here in measurement is repeated under identical conditions (116). A Step 9, validation is an ongoing process that starts with the number of standard statistics have been developed to assess identification and definition of the domain of study (Step 1) and reliability of a scale, including Cronbach’s alpha (117), ordinal continues to its generalizability with other constructs (Step 9) alpha (118, 119) specific to binary and ordinal scale items, (36). The validity of an instrument can be examined in numerous test–retest reliability (coefficient of stability) (1, 2), McDonald’s ways; the most common tests of validity are content validity Omega (120), Raykov’s rho (2) or Revelle’s beta (121, 122), split- (described in Step 2), which can be done prior to the instrument half estimates, Spearman-Brown formula, alternate form method being administered to the target population, and criterion (coefficient of equivalence), and inter-observer reliability (1, 2). (predictive and concurrent) and construct validity (convergent, Of these statistics, Cronbach’s alpha and test–retest reliability are discriminant, differentiation by known groups, correlations), predominantly used to assess reliability of scales (2, 117). which occurs after survey administration. Cronbach’s Alpha Criterion Validity Cronbach’s alpha assesses the internal consistency of the scale Criterion validity is the “degree to which there is a relationship items, i.e., the degree to which the set of items in the scale co-vary, between a given test score and performance on another measure relative to their sum score (1, 2, 117). An alpha coefficient of 0.70 of particular relevance, typically referred to as criterion” (1, 2). has often been regarded as an acceptable threshold for reliability; There are two forms of criterion validity: predictive (criterion) Frontiers in Public Health | www.frontiersin.org 13 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation validity and concurrent (criterion) validity. Predictive validity is with other tests which are intended to measure the same “the extent to which a measure predicts the answers to some construct. other question or a result to which it ought to be related with” Discriminant validity is the extent to which a measure is (31). Thus, the scale should be able to predict a behavior in the novel and not simply a reflection of some other construct (126). future. An example is the ability for an exclusive breastfeeding Specifically, it is the “degree to which scores on a studied social support scale to predict exclusive breastfeeding (10). Here, instrument are differentiated from behavioral manifestations of the mother’s willingness to exclusively breastfeed occurs after other constructs, which on theoretical grounds can be expected social support has been given, i.e., it should predict the behavior. not to be related to the construct underlying the instrument Predictive validity can be estimated by examining the association under investigation” (2). This is best estimated through the multi- between the scale scores and the criterion in question. trait multi method matrix (2). Discriminant validity is indicated Concurrent criterion validity is the extent to which test by predictably low or weak correlations between the measure of scores have a stronger relationship with criterion (gold standard) interest and other measures that are supposedly not measuring measurement made at the time of test administration or shortly the same variable or concept (126). The newly developed afterward (2). This can be estimated using Pearson product- construct can be invalidated by too high correlations with other moment correlation or latent variable modeling. The work tests which are intended to differ in their measurements (37). of Greca and Stone on the psychometric evaluation of the This approach is critical in differentiating the newly developed revised version of a social anxiety scale for children (SASC- construct from other rival alternatives (36). R) provides a good example for the evaluation of concurrent Differentiation or comparison between known groups validity (140). In this study, the authors collected data on examines the distribution of a newly developed scale score an earlier validated version of the SASC scale consisting of over known binary items (126). This is premised on previous 10 items, as well as the revised version, SASC-R, which had theoretical and empirical knowledge of the performance of the additional 16 items making a 26-item scale. The SASC consisted binary groups. An example of best practice is seen in the work of of two sub scales [fear of negative evaluation (FNE), social Boateng et al. on the validation of a household water insecurity avoidance and distress (SAD)] and the SASC-R produced three scale in Kenya. In this study, we compared the mean household new subscales (FNE, SAD-New, and SAD-General). Using a water insecurity scores over households with or without E. coli Pearson product-moment correlation, the authors examined the present in their drinking water. Consistent with what we knew inter-correlations between the common subscales for FNE, and from the extant literature, we found households with E. coli between SAD and SAD-New. With a validity coefficient of 0.94 present in their drinking water had higher mean water insecurity and 0.88, respectively, the authors found evidence of concurrent scores than households that had no E. coli in drinking water. validity. This suggested our scale could discriminate between particular A limitation of concurrent validity is that this strategy for known groups. validity does not work with small sample sizes because of their Although correlational analysis is frequently used by large sampling errors. Secondly, appropriate criterion variables several scholars, bivariate regression analysis is preferred or “gold standards” may not be available (2). This reason may to correlational analysis for quantifying validity (127, 128). account for its omission in most validation studies. Regression analysis between scale scores and an indicator of the domain examined has a number of important advantages Construct Validity over correlational analysis. First, regression analysis quantifies Construct validity is the “extent to which an instrument the association in meaningful units, facilitating judgment assesses a construct of concern and is associated with evidence of validity. Second, regression analysis avoids confounding that measures other constructs in that domain and measures validity with the underlying variation in the sample and specific real-world criteria” (2). Four indicators of construct therefore the results from one sample are more applicable to validity are relevant to scale development: convergent validity, other samples in which the underlying variation may differ. discriminant validity, differentiation by known groups, and Third, regression analysis is preferred because the regression correlation analysis. model can be used to examine discriminant validity by adding Convergent validity is the extent to which a construct potential alternative measures. In addition to regression analysis, measured in different ways yields similar results. Specifically, alternative techniques such as analysis of standard deviations of it is the “degree to which scores on a studied instrument are the differences between scores and the examination of intraclass related to measures of other constructs that can be expected correlation coefficients (ICC) have been recommended as viable on theoretical grounds to be close to the one tapped into by options (128). this instrument” (2, 37, 126). This is best estimated through Taken together, these methods make it possible to assess the multi-trait multi-method matrix (2), although in some cases the validity of an adapted or a newly developed scale. In researchers have used either latent variable modeling or Pearson addition to predictive validity, existing studies in fields such as product-moment correlation based on Fisher’s Z transformation. health, social, and behavioral sciences have shown that scale Evidence of convergent validity of a construct can be provided by validity is supported if at least two of the different forms of the extent to which the newly developed scale correlates highly construct validity discussed in this section have been examined. with other variables designed to measure the same construct Further information about establishing validity and constructing (2, 126). It can be invalidated by too low or weak correlations indictors from scales can be found in Frongillo et al. (141). Frontiers in Public Health | www.frontiersin.org 14 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation CONCLUSIONS they will include, rather than omitting a step out of lack of knowledge. In sum, we have sought to give an overview of the key steps in Well-designed scales are the foundation of much of our scale development and validation (Figure 1) as well as to help the understanding of a range of phenomena, but ensuring that reader understand how one might approach each step (Table 1). we accurately quantify what we purport to measure is not a We have also given a basic introduction to the conceptual and simple matter. By making scale development more approachable methodological underpinnings of each step. and transparent, we hope to facilitate the advancement of Because scale development is so complicated, this should our understanding of a range of health, social, and behavioral be considered a primer, i.e., a “jumping off point” for anyone outcomes. interested in scale development. The technical literature and examples of rigorous scale development mentioned throughout AUTHOR CONTRIBUTIONS will be important for readers to pursue. There are a number of matters not addressed here, including how to interpret GB and SY developed the first draft of the scale development and validation manuscript. All authors participated in the editing and scale output, the designation of cut-offs, when indices, rather than scales, are more appropriate, and principles for re- critical revision of the manuscript and approved the final version of the manuscript for publication. testing scales in new populations. Also, this review leans more toward the classical test theory approach to scale development; a comprehensive review on IRT modeling will FUNDING be complementary. We hope this review helps to ease readers Funding for this work was obtained by SY through the National into the literature, but space precludes consideration of all these Institute of Mental Health—R21 MH108444. The content is topics. solely the responsibility of the authors and does not necessarily The necessity of the nine steps that we have outlined here represent the official views of the National Institute of Mental (Table 1, Figure 1) will vary from study to study. While studies Health or the National Institutes of Health. focusing on developing scales de novo may use all nine steps, others, e.g., those that set out to validate existing scales, may end up using only the last four steps. Resource constraints, including ACKNOWLEDGMENTS time, money, and participant attention and patience are very We would like to acknowledge the importance of the works real, and must be acknowledged as additional limits to rigorous of several scholars of scale development and validation scale development. We cannot state which steps are the most used in developing this primer, particularly Robert DeVellis, important; difficult decisions about which steps to approach less rigorously can only be made by each scale developer, based on Tenko Raykov, George Marcoulides, David Streiner, and Betsy McCoach. We would also like to acknowledge the help of Josh the purpose of the research, the proposed end-users of the scale, and resources available. It is our hope, however, that by outlining Miller of Northwestern University for assisting with design of Figure 1 and development of Table 1, and we thank Zeina the general shape of the phases and steps in scale development, researchers will be able to purposively choose the steps that Jamuladdine for helpful comments on tests of unidimensionality. REFERENCES 9. Hirani SAA, Karmaliani R, Christie T, Rafique G. Perceived Breastfeeding Support Assessment Tool (PBSAT): development and 1. DeVellis RF. Scale Development: Theory and Application. Los Angeles, CA: testing of psychometric properties with Pakistani urban working Sage Publications (2012). mothers. Midwifery (2013) 29:599–607. doi: 10.1016/j.midw.2012. 2. Raykov T, Marcoulides GA. Introduction to Psychometric Theory. New York, 05.003 NY: Routledge, Taylor & Francis Group (2011). 10. Boateng GO, Martin S., Collins S, Natamba BK, Young SL. Measuring 3. Streiner DL, Norman GR, Cairney J. Health Measurement Scales: A Practical exclusive breastfeeding social support: scale development and validation in Guide to Their Development and Use. Oxford University Press (2015). Uganda. Matern Child Nutr. (2018). doi: 10.1111/mcn.12579. [Epub ahead of 4. McCoach DB, Gable RK, Madura, JP. Instrument Development in the Affective print]. Domain. School and Corporate Applications, 3rd Edn. New York, NY: 11. Arbach A, Natamba BK, Achan J, Griffiths JK, Stoltzfus RJ, Mehta S, Springer (2013). et al. Reliability and validity of the center for epidemiologic studies- 5. Morgado FFR, Meireles JFF, Neves CM, Amaral ACS, Ferreira MEC. depression scale in screening for depression among HIV-infected Scale development: ten main limitations and recommendations to and -uninfected pregnant women attending antenatal services in improve future research practices. Psicol Reflex E Crítica (2018) 30:3. northern Uganda: a cross-sectional study. BMC Psychiatry (2014) 14:303. doi: 10.1186/s41155-016-0057-1 doi: 10.1186/s12888-014-0303-y 6. Glanz K, Rimer BK, Viswanath K. Health Behavior: Theory, Research, and 12. Natamba BK, Kilama H, Arbach A, Achan J, Griffiths JK, Young SL. Practice. San Francisco, CA: John Wiley & Sons, Inc (2015). Reliability and validity of an individually focused food insecurity access 7. Ajzen I. From intentions to actions: a theory of planned behavior. In: Action scale for assessing inadequate access to food among pregnant Ugandan Control SSSP Springer Series in Social Psychology Berlin; Heidelberg: Springer, women of mixed HIV status. Public Health Nutr. (2015) 18:2895–905. (1985). p. 11–39. doi: 10.1017/S1368980014001669 8. Bai Y, Peng C-YJ, Fly AD. Validation of a short questionnaire to assess 13. Neilands TB, Chakravarty D, Darbes LA, Beougher SC, Hoff CC. mothers’ perception of workplace breastfeeding support. J Acad Nutr Diet Development and validation of the sexual agreement investment scale. J Sex (2008) 108:1221–5. doi: 10.1016/j.jada.2008.04.018 Res. (2010) 47:24–37. doi: 10.1080/00224490902916017 Frontiers in Public Health | www.frontiersin.org 15 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation 14. Neilands TB, Choi K-H. A validation and reduced form of the new and existing techniques. MIS Q. (2011) 35:293. doi: 10.2307/23 female condom attitudes scale. AIDS Educ Prev. (2002) 14:158–71. 044045 doi: 10.1521/aeap.14.2.158.23903 36. Messick S. Validity of psychological assessment: validation of inferences 15. Lippman SA, Neilands TB, Leslie HH, Maman S, MacPhail C, Twine from persons’ responses and performance as scientifica inquiry into R, et al. Development, validation, and performance of a scale to score meaning. Am Psychol. (1995) 50:741–9. doi: 10.1037/0003-066X. measure community mobilization. Soc Sci Med. (2016) 157:127–37. 50.9.741 doi: 10.1016/j.socscimed.2016.04.002 37. Campbell DT, Fiske DW. Convergent and discriminant validity by 16. Johnson MO, Neilands TB, Dilworth SE, Morin SF, Remien RH, Chesney the multitrait-multimethod matrix. Psychol Bull. (1959) 56:81–105. MA. The role of self-efficacy in HIV treatment adherence: validation of doi: 10.1037/h0046016 the HIV treatment adherence self-efficacy scale (HIV-ASES). J Behav Med. 38. Dennis C. Theoretical underpinnings of breastfeeding confidence: a self- (2007) 30:359–70. doi: 10.1007/s10865-007-9118-3 efficacy framework. J Hum Lact. (1999) 15:195–201. doi: 10.1177/08903 17. Sexton JB, Helmreich RL, Neilands TB, Rowan K, Vella K, Boyden 3449901500303 J, et al. The Safety Attitudes Questionnaire: psychometric properties, 39. Dennis C-L, Faux S. Development and psychometric testing of the benchmarking data, and emerging research. BMC Health Serv Res. (2006) Breastfeeding Self-Efficacy Scale. Res Nurs Health (1999) 22:399–409. doi: 10. 6:44. doi: 10.1186/1472-6963-6-44 1002/(SICI)1098-240X(199910)22:5<399::AID-NUR6>3.0.CO;2-4 18. Wolfe WS, Frongillo EA. Building household food-security measurement 40. Dennis C-L. The breastfeeding self-efficacy scale: psychometric assessment tools from the ground up. Food Nutr Bull. (2001) 22:5–12. of the short form. J Obstet Gynecol Neonatal Nurs. (2003) 32:734–44. doi: 10.1177/156482650102200102 doi: 10.1177/0884217503258459 19. González W, Jiménez A, Madrigal G, Muñoz LM, Frongillo EA. 41. Frongillo EA, Nanama S. Development and validation of an experience- Development and validation of measure of household food insecurity in based measure of household food insecurity within and across urban costa rica confirms proposed generic questionnaire. J Nutr. (2008) seasons in Northern Burkina Faso. J Nutr. (2006) 136:1409S−19S. 138:587–92. doi: 10.1093/jn/138.3.587 doi: 10.1093/jn/136.5.1409S 20. Boateng GO, Collins SM, Mbullo P, Wekesa P, Onono M, Neilands T, et 42. Guion R. Content validity - the source of my discontent. Appl Psychol Meas. al. A novel household water insecurity scale: procedures and psychometric (1977) 1:1–10. doi: 10.1177/014662167700100103 analysis among postpartum women in western Kenya. PloS ONE. (2018). 43. Lawshe C. A quantitative approach to content validity. Pers Psychol. (1975) doi: 10.1371/journal.pone.0198591 28:563–75. doi: 10.1111/j.1744-6570.1975.tb01393.x 21. Melgar-Quinonez H, Hackett M. Measuring household food 44. Lynn M. Determination and quantification of content validity. Nurs Res. security: the global experience. Rev Nutr. (2008) 21:27s−37s. (1986) 35:382–5. doi: 10.1097/00006199-198611000-00017 doi: 10.1590/S1415-52732008000700004 45. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 22. Melgar-Quiñonez H, Zubieta AC, Valdez E, Whitelaw B, Kaiser L. (1960) 20:37–46. doi: 10.1177/001316446002000104 Validación de un instrumento para vigilar la inseguridad alimentaria en 46. Wynd CA, Schmidt B, Schaefer MA. Two quantitative approaches la Sierra de Manantlán, Jalisco. Salud Pública México (2005) 47:413–22. for estimating content validity. West J Nurs Res. (2003) 25:508–18. doi: 10.1590/S0036-36342005000600005 doi: 10.1177/0193945903252998 23. Hackett M, Melgar-Quinonez H, Uribe MCA. Internal validity of a 47. Linstone HA, Turoff M. (eds). The Delphi Method. Reading, MA: Addison- household food security scale is consistent among diverse populations Wesley (1975). participating in a food supplement program in Colombia. BMC Public Health 48. Augustine LF, Vazir S, Rao SF, Rao MV, Laxmaiah A, Ravinder (2008) 8:175. doi: 10.1186/1471-2458-8-175 P, et al. Psychometric validation of a knowledge questionnaire 24. Hinkin TR. A review of scale development practices in the study on micronutrients among adolescents and its relationship to of organizations. J Manag. (1995) 21:967–88. doi: 10.1016/0149- micronutrient status of 15–19-year-old adolescent boys, Hyderabad, 2063(95)90050-0 India. Public Health Nutr. (2012) 15:1182–9. doi: 10.1017/S13689800120 25. Haynes SN, Richard DCS, Kubany ES. Content validity in psychological 00055 assessment: a functional approach to concepts and methods. Pyschol Assess. 49. Beatty PC, Willis GB. Research synthesis: the practice of cognitive (1995) 7:238–47. doi: 10.1037/1040-3590.7.3.238 interviewing. Public Opin Q. (2007) 71:287–311. doi: 10.1093/poq/nfm006 26. Kline P. A Handbook of Psychological Testing. 2nd Edn. London: Routledge; 50. Alaimo K, Olson CM, Frongillo EA. Importance of cognitive testing for Taylor & Francis Group (1993). survey items: an example from food security questionnaires. J Nutr Educ. 27. Hunt SD. Modern Marketing Theory. Cincinnati: South-Western Publishing (1999) 31:269–75. doi: 10.1016/S0022-3182(99)70463-2 (1991). 51. Willis GB. Cognitive Interviewing and Questionnaire Design: A Training 28. Loevinger J. Objective tests as instruments of psychological theory. Psychol Manual. Cognitive Methods Staff Working Paper Series. Hyattsville, MD: Rep. (1957) 3:635–94. doi: 10.2466/pr0.1957.3.3.635 National Center for Health Statistics (1994). 29. Clarke LA, Watson D. Constructing validity: basic issues in 52. Willis GB. Cognitive Interviewing: A Tool for Improving Questionnaire objective scale development. Pyschol Assess. (1995) 7:309–19. Design. Thousand Oaks, CA: Sage Publications (2005). doi: 10.1037/1040-3590.7.3.309 53. Tourangeau R. Cognitive aspects of survey measurement and 30. Schinka JA, Velicer WF, Weiner IR. Handbook of Psychology, Vol. 2, Research mismeasurement. Int J Public Opin Res. (2003) 15:3–7. doi: 10.1093/ Methods in Psychology. Hoboken, NJ: John Wiley & Sons, Inc. (2012). ijpor/15.1.3 31. Fowler FJ. Improving Survey Questions: Design and Evaluation. Thousand 54. Morris MD, Neilands TB, Andrew E, Mahar L, Page KA, Hahn Oaks, CA: Sage Publications (1995). JA. Development and validation of a novel scale for measuring 32. Krosnick JA. Questionnaire design. In: Vannette DL, Krosnick JA, editors. interpersonal factors underlying injection drug using behaviours The Palgrave Handbook of Survey Research. Cham: Palgrave Macmillan among injecting partnerships. Int J Drug Policy (2017) 48:54–62. (2018), pp. 439–55. doi: 10.1016/j.drugpo.2017.05.030 33. Krosnick JA, Presser S. Question and questionnaire design. In: Wright JD, 55. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research Marsden PV, editors. Handbook of Survey Research. San Diego, CA: Elsevier electronic data capture (REDCap)—a metadata-driven methodology and (2009), pp. 263–314. workflow process for providing translational research informatics support. 34. Rhemtulla M, Brosseau-Liard PÉ, Savalei V. When can categorical variables J Biomed Inform. (2009) 42:377–81. doi: 10.1016/j.jbi.2008.08.010 be treated as continuous? A comparison of robust continuous and categorical 56. GoldsteinM, Benerjee R, Kilic T. Paper v Plastic Part 1: The Survey SEM estimation methods under suboptimal conditions. Psychol Methods Revolution Is in Progress. The World Bank Development Impact. (2012). (2012) 17:354–73. doi: 10.1037/a0029315 Available online at: http://blogs.worldbank.org/impactevaluations/paper- 35. MacKenzie SB, Podsakoff PM, Podsakoff NP. Construct measurement v-plastic-part-i-the-survey-revolution-is-in-progress (Accessed November and validation procedures in MIS and behavioral research: integrating 10, 2017). Frontiers in Public Health | www.frontiersin.org 16 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation 57. Fanning J, McAuley E. A Comparison of tablet computer and paper-based 83. Kenward MG, Carpenter J. Multiple imputation: current perspectives. Stat questionnaires in healthy aging research. JMIR Res Protoc. (2014) 3:e38. Methods Med Res. (2007) 16:199–218. doi: 10.1177/0962280206075304 doi: 10.2196/resprot.3291 84. Gottschall AC, West SG, Enders CK. A Comparison of item-level and scale- 58. Greenlaw C, Brown-Welty S. A Comparison of web-based and paper-based level multiple imputation for questionnaire batteries. Multivar Behav Res. survey methods: testing assumptions of survey mode and response cost. Eval (2012) 47:1–25. doi: 10.1080/00273171.2012.640589 Rev. (2009) 33:464–80. doi: 10.1177/0193841X09340214 85. Cattell RB. The Scree test for the number of factors. Multivar Behav Res. 59. MacCallum RC, Widaman KF, Zhang S, Hong S. Sample size in factor (1966) 1: 245–76. doi: 10.1207/s15327906mbr0102_10 analysis. Psychol Methods (1999) 4:84–99. doi: 10.1037/1082-989X.4.1.84 86. Horn JL. A rationale and test for the number of factors in factor analysis. 60. Nunnally JC. Pyschometric Theory. New York, NY: McGraw-Hill (1978). Psychometrika (1965) 30:179–85. doi: 10.1007/BF02289447 61. Guadagnoli E, Velicer WF. Relation of sample size to the stability 87. Velicer WF. Determining the number of components from the of component patterns. Am Psychol Assoc. (1988) 103:265–75. matrix of partial correlations. Psychometrika (1976) 41:321–7. doi: 10.1037/0033-2909.103.2.265 doi: 10.1007/BF02293557 62. Comrey AL. Factor-analytic methods of scale development in personality 88. Lorenzo-Seva U, Timmerman ME, Kiers HAL. The hull method for selecting and clinical psychology. Am Psychol Assoc. (1988) 56:754–61. the number of common factors. Multivar Behav Res. (2011) 46:340–64. 63. Comrey AL, Lee H. A First Cours in Factor Analysis. Hillsdale, NJ: Lawrence doi: 10.1080/00273171.2011.564527 Erlbaum Associates, Inc. (1992). 89. Jolijn Hendriks AA, Perugini M, Angleitner A, Ostendorf F, Johnson 64. Ong DC. A Primer to Bootstrapping and an Overview of doBootstrap. JA, De Fruyt F, et al. The five-factor personality inventory: cross- Stanford, CA: Department of Psychology, Stanford University (2014). cultural generalizability across 13 countries. Eur J Pers. (2003) 17:347–73. 65. Osborne JW, Costello AB. Sample size and subject to item ratio in principal doi: 10.1002/per.491 components analysis. Pract Assess Res Eval. (2004) 99:1–15. Available online 90. Bond TG, Fox C. Applying the Rasch Model: Fundamental Measurement in at: http://pareonline.net/htm/v9n11.htm the Human Sciences. Mahwah, NJ: Erlbaum (2013). 66. Ebel R., Frisbie D. Essentials of Educational Measurement. Englewood Cliffs, 91. Brown T. Confirmatory Factor Analysis for Applied Research. New York, NY: NJ: Prentice-Hall (1979). Guildford Press (2014). 67. Hambleton R., Jones R. An NCME instructional module on comparison 92. Morin AJS, Arens AK, Marsh HW. A bifactor exploratory structural equation of classical test theory and item response theory and their applications modeling framework for the identification of distinct sources of construct- to test development. Educ Meas Issues Pract. (1993) 12:38–47. relevant psychometric multidimensionality. Struct Equ Model Multidiscip J. doi: 10.1111/j.1745-3992.1993.tb00543.x (2016) 23:116–39. doi: 10.1080/10705511.2014.961800 68. Raykov T. Scale Construction and Development. Lecture Notes. Measurement 93. Cochran WG. The χ test of goodness of fit. Ann Math Stat. (1952) 23:315– and Quantitative Methods. East Lansing, MI: Michigan State University 45. doi: 10.1214/aoms/1177729380 (2015). 94. Brown MW. Confirmatory Factor Analysis for Applied Research. New York, 69. Whiston SC. Principles and Applications of Assessment in Counseling. NY: Guildford Press (2014). Cengage Learning (2008). 95. Tucker LR, Lewis C. A reliability coefficient for maximum likelihood factor 70. Brennan RL. A generalized upper-lower item discrimination index. analysis. Psychometrika (1973) 38:1–10. doi: 10.1007/BF02291170 Educ Psychol Meas. (1972) 32:289–303. doi: 10.1177/0013164472032 96. Bentler PM, Bonett DG. Significance tests and goodness of fit in 00206 the analysis of covariance structures. Psychol Bull. (1980) 88:588–606. 71. Popham WJ, Husek TR. Implications of criterion-referenced measurement. doi: 10.1037/0033-2909.88.3.588 J Educ Meas. (1969) 6:1–9. doi: 10.1111/j.1745-3984.1969.tb00654.x 97. Bentler PM. Comparative fit indexes in structural models. Psychol Bull. 72. Rasiah S-MS, Isaiah R. Relationship between item difficulty and (1990) 107:238–46. doi: 10.1037/0033-2909.107.2.238 discrimination indices in true/false-type multiple choice questions of 98. Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure a para-clinical multidisciplinary paper. Ann Acad Med Singap. (2006) analysis: Conventional criteria versus new alternatives. Struct Equ Model 35:67–71. Available online at: http://repository.um.edu.my/id/eprint/65455 Multidiscip J. (1999) 6:1–55. doi: 10.1080/10705519909540118 73. Demars C. Item Respons Theory. New York, NY: Oxford University Press 99. Jöreskog KG, Sörbom D. LISREL 8.54. Structural Equation Modeling With (2010). the Simplis Command Language (2004) Available online at: http://www.unc. 74. Lord FM. Applications of Item Response Theory to Practical Testing Problems. edu/~rcm/psy236/holzcfa.lisrel.pdf New Jersey, NJ: Englewood Cliffs (1980). 100. Browne MW, Cudeck R. Alternative ways of assessing model fit. In: Bollen 75. Bazaldua DAL, Lee Y-S, Keller B, Fellers L. Assessing the KA, Long, JS, editors. Testing Structural Equation Models. Newbury Park, performance of classical test theory item discrimination estimators CA: Sage Publications (1993). p. 136–62. in Monte Carlo simulations. Asia Pac Educ Rev. (2017) 18:585–98. 101. Yu C. Evaluating Cutoff Criteria of Model Fit Indices for Latent Variable doi: 10.1007/s12564-017-9507-4 Models With Binary and Continuous Outcomes. Los Angeles, CA: University 76. Piedmont RL. Inter-item correlations. In Encyclopedia of Quality of of California, Los Angeles. (2002). Life and Well-Being Research. Dordrecht: Springer (2014). p. 3303–4. 102. Gerbing DW, Hamilton JG. Viability of exploratory factor analysis as a doi: 10.1007/978-94-007-0753-5_1493 precursor to confirmatory factor analysis. Struct Equ Model Multidiscip J. 77. Tarrant M, Ware J, Mohammed AM. An assessment of functioning and non- (1996) 3:62–72. doi: 10.1080/10705519609540030 functioning distractors in multiple-choice questions: a descriptive analysis. 103. Reise SP, Morizot J, Hays RD. The role of the bifactor model in resolving BMC Med Educ. (2009) 9:40. doi: 10.1186/1472-6920-9-40 dimensionality issues in health outcomes measures. Qual Life Res. (2007) 78. Fulcher G, Davidson F. The Routledge Handbook of Language Testing. New 16:19–31. doi: 10.1007/s11136-007-9183-7 York, NY: Routledge (2012). 104. Gibbons RD, Hedeker DR. Full-information item bi-factor analysis. 79. Cizek GJ, O’Day DM. Further investigation of nonfunctioning options in Psychometrika (1992) 57:423–36. doi: 10.1007/BF02295430 multiple-choice test items. Educ Psychol Meas. (1994) 54:861–72. 105. Reise SP, Moore TM, Haviland MG. Bifactor models and rotations: exploring 80. Haladyna TM, Downing SM. Validity of a taxonomy of multiple-choice the extent to which multidimensional data yield univocal scale scores. J Pers item-writing rules. Appl Meas Educ. (1989) 2:51–78. doi: 10.1207/s153248 Assess. (2010) 92:544–59. doi: 10.1080/00223891.2010.496477 18ame0201_4 106. Brunner M, Nagy G, Wilhelm O. A Tutorial on hierarchically structured 81. Tappen RM. Advanced Nursing Research. Sudbury, MA: Jones & Bartlett constructs. J Pers. (2012) 80:796–846. doi: 10.1111/j.1467-6494.2011.00749.x Publishers (2011). 107. Vandenberg RJ, Lance CE. A review and synthesis of the measurement 82. Enders CK, Bandalos DL. The relative performance of full invariance literature: suggestions, practices, and recommendations information maximum likelihood estimation for missing data in for organizational research - Robert J. Vandenberg, Charles E. Lance, structural equation models. Struct Equ Model. (2009) 8:430–57. 2000. Organ Res Methods (2000) 3:4–70. doi: 10.1177/10944281 doi: 10.1207/S15328007SEM0803_5 0031002 Frontiers in Public Health | www.frontiersin.org 17 June 2018 | Volume 6 | Article 149 Boateng et al. Scale Development and Validation 108. Sideridis GD, Tsaousis I, Al-harbi KA. Multi-population invariance with of dietary assessment methods. Eur J Epidemiol. (1991) 7:339–43. dichotomous measures: combining multi-group and MIMIC methodologies doi: 10.1007/BF00144997 in evaluating the general aptitude test in the arabic language - Georgios D. 129. McPhail SM. Alternative Validation Strategies: Developing New and Sideridis, Ioannis Tsaousis, Khaleel A. Al-harbi, 2015. J Psychoeduc Assess. Leveraging Existing Validity Evidence. San Francisco, CA: John Wiley & Sons, 33:568–84. doi: 10.1177/0734282914567871 Inc (2007). 109. Joreskog K. A general method for estimating a linear equation system. In: 130. Dray S, Dunsch F, Holmlund M. Electronic Versus Paper-Based Data Goldberger AS, Duncan OD, editors. Structural Equation Models in the Social Collection: Reviewing the Debate. The World Bank Development Impact Sciences. New York, NY: Seminar Press (1973). pp. 85–112. (2016). Available online at: https://blogs.worldbank.org/impactevaluations/ 110. Kim ES, Cao C, Wang Y, Nguyen DT. Measurement invariance testing with electronic-versus-paper-based-data-collection-reviewing-debate (Accessed many groups: a comparison of five approaches. Struct Equ Model Multidiscip November 10, 2017). J. (2017) 24:524–44. doi: 10.1080/10705511.2017.1304822 131. Ellen JM, Gurvey JE, Pasch L, Tschann J, Nanda JP, Catania J. A 111. Muthén B., Asparouhov T. BSEM Measurement Invariance Analysis. randomized comparison of A-CASI and phone interviews to assess (2017). Available online at: https://www.statmodel.com/examples/webnotes/ STD/HIV-related risk behaviors in teens. J Adolesc Health (2002) 31:26–30. webnote17.pdf doi: 10.1016/S1054-139X(01)00404-9 112. Asparouhov T, Muthén B. Multiple-group factor analysis alignment. Struct 132. Chesney MA, Neilands TB, Chambers DB, Taylor JM, Folkman S. Equ Model. 21:495–508. doi: 10.1080/10705511.2014.919210 A validity and reliability study of the coping self-efficacy scale. Br 113. Reise SP, Widaman KF, Pugh RH. Confirmatory factor analysis and item J Health Psychol. (2006) 11(Pt 3):421–37. doi: 10.1348/135910705 response theory: two approaches for exploring measurement invariance. X53155 Psychol Bull. (1993) 114:552–66. doi: 10.1037/0033-2909.114.3.552 133. Thurstone L. Multiple-Factor Analysis. Chicago, IL: University of Chicago 114. Pushpanathan ME, Loftus AM, Gasson N, Thomas MG, Timms CF, Press (1947). Olaithe M, et al. Beyond factor analysis: multidimensionality and the 134. Fan X. Item response theory and classical test theory: an empirical Parkinson’s disease sleep scale-revised. PLoS ONE (2018) 13:e0192394. comparison of their item/person statistics. Educ Psychol Meas. (1998) doi: 10.1371/journal.pone.0192394 58:357–81. doi: 10.1177/0013164498058003001 115. Armor DJ. Theta reliability and factor scaling. Sociol Methodol. (1973) 135. Glockner-Rist A, Hoijtink H. The best of both worlds: factor analysis 5:17–50. doi: 10.2307/270831 of dichotomous data using item response theory and structural 116. Porta M. A Dictionary of Epidemiology. New York, NY: Oxford University equation modeling. Struct Equ Model Multidiscip J. (2003) 10:544–65. Press (2008). doi: 10.1207/S15328007SEM1004_4 117. Cronbach LJ. Coefficient alpha and the internal structure of tests. 136. Keeves JP, Alagumalai S, editors. Applied Rasch Measurement: A Book of Psychometrika (1951) 16:297–334. doi: 10.1007/BF02310555 Exemplars: Papers in Honour of John P. Keeves. Dordrecht ; Norwell, MA: 118. Zumbo B, Gadermann A, Zeisser C. Ordinal versions of coefficients alpha Springer (2005). and theta for likert rating scales. J Mod Appl Stat Methods (2007) 6:21–9. 137. Cappelleri JC, Lundy JJ, Hays RD. Overview of classical test theory and item doi: 10.22237/jmasm/1177992180 response theory for quantitative assessment of items in developing 119. Gadermann AM, GuhnM, Zumbo B. Estimating ordinal reliability for Likert patient-reported outcome measures. Clin Ther. (2014) 36:648–62. type and ordinal item response data: a conceptual, empirical, and practical doi: 10.1016/j.clinthera.2014.04.006 guide. Pract Assess Res Eval. (2012) 17:1–13. Available online at: http://www. 138. Harvey RJ, Hammer AL. Item response theory. Couns Psychol. (1999) pareonline.net/getvn.asp?v=17&n=3 27:353–83. doi: 10.1177/0011000099273004 120. McDonald RP. Test Theory: A Unified Treatment. New Jersey, NJ : Lawrence 139. Cook KF, Kallen MA, Amtmann D. Having a fit: impact of number Erlbaum Associates, Inc (1999). of items and distribution of data on traditional criteria for assessing 121. Revelle W. Hierarchical cluster analysis and the internal structure of tests. IRT’s unidimensionality assumption. Qual. Life Res. (2009) 18:447–60. Multivar Behav Res. (1979) 14:57–74. doi: 10.1207/s15327906mbr1401_4 doi: 10.1007/s11136-009-9464-4 122. Revelle W, Zinbarg RE. Coefficients alpha, beta, omega, and the glb: 140. Greca AML, Stone WL. Social anxiety scale for children-revised: factor comments on Sijtsma. Psychometrika (2009) 74:145. doi: 10.1007/s11336- structure and concurrent validity. J Clin Child Psychol. (1993) 22:17–27. 008-9102-z doi: 10.1207/s15374424jccp2201_2 123. Bernstein I, Nunnally JC. Pyschometric Theory. New York, NY: McGraw-Hill 141. Frongillo EA, Nanama S, Wolfe WS. Technical Guide to Developing a (1994). Direct, Experience-Based Measurement Tool for Household Food Insecurity. 124. Weir JP. JP: Quantifying test-retest reliability using the intraclass Washington, DC: Food and Nutrition Technical Assistance Project correlation coefficient and the SEM. J Strength Con Res. (2005) 19:231–40. (2004). doi: 10.1519/15184.1 125. Rousson V, Gasser T, Seifert B. Assessing intrarater, interrater and test– Conflict of Interest Statement: The authors declare that the research was retest reliability of continuous measurements. Stat Med. (2002) 21:3431–46. conducted in the absence of any commercial or financial relationships that could doi: 10.1002/sim.1253 be construed as a potential conflict of interest. 126. Churchill GA. A paradigm for developing better measures of marketing constructs. J Mark Res. (1979) 16:64–73. doi: 10.2307/3150876 Copyright © 2018 Boateng, Neilands, Frongillo, Melgar-Quiñonez and Young. This 127. Bland JM, Altman DG. A note on the use of the intraclass is an open-access article distributed under the terms of the Creative Commons correlation coefficient in the evaluation of agreement between two Attribution License (CC BY). The use, distribution or reproduction in other forums methods of measurement. Comput Biol Med. (1990) 20:337–40. is permitted, provided the original author(s) and the copyright owner are credited doi: 10.1016/0010-4825(90)90013-F and that the original publication in this journal is cited, in accordance with accepted 128. Hebert JR, Miller DR. The inappropriateness of conventional use academic practice. No use, distribution or reproduction is permitted which does not of the correlation coefficient in assessing validity and reliability comply with these terms. Frontiers in Public Health | www.frontiersin.org 18 June 2018 | Volume 6 | Article 149

Journal

Frontiers in Public HealthUnpaywall

Published: Jun 11, 2018

There are no references for this article.