TY - JOUR AU - Carter, Stephen AB - Abstract Measurement instruments are used to collect data about respondents. In social pharmacy, measurement instruments are often used to measure latent constructs, such as attitudes, among healthcare professionals and patients. This paper aims to describe the fundamental aspects of designing and validating instruments, which aim to measure latent constructs. The main focus of this manuscript is to describe the considerations and processes relating to exploratory and confirmatory factor analyses, when used to develop measures of latent psychosocial constructs. However, it also presents a detailed summary of the current evidence and suggestions for item generation and sample selection, as well as, an in-depth description of approaches to content and face validation. Suggestions for further reading are also provided. factor analysis, measurement instrument, psychometric testing, survey, validity Introduction Measurement instruments are used to collect a broad range of data about people’s demographics, knowledge, attitudes and behaviours, among other constructs. In recent years, given the widespread accessibility and acceptability of the Internet, it is now easier than ever to disseminate scales and collect data. Advances in technology have enabled researchers to reach respondents quicker and easier than ever before. Although there are many psychometrically tested measurement instruments available for use in the literature, researchers often need to develop new measurement instruments to collect data about phenomena that have not been explored before. The purpose of this manuscript was to summarise the key considerations for researchers embarking on this task and to describe, in-depth, the process of construct validation using exploratory and confirmatory factor analysis. Understanding how to design psychometrically sound measurement instruments to measure psychological and social (psychosocial) phenomena is an important skill for many social pharmacy researchers. A psychometrically sound measurement instrument is one that measures what it intends to measure, consistently over time, in a specific setting and population. One example of a measurement instrument is a survey. A survey may be defined as an information gathering tool to quantitatively measure the attributes of a sample of people.[1] Surveys are commonly used in a broad range of settings, including social pharmacy, and are often the measurement tool of choice as they represent an inexpensive form of data collection that is relatively easy to disseminate. However, preceding these steps, a survey may need to be constructed, if one does not currently exist for the intended purpose. Alternatively, if one does exist, it may need to be adapted for a specific setting or population. Just because we are able to write up a set of questions, as researchers, it does not mean that they consistently measure what we want them to measure. Conceptualisation of the construct of interest Measurement instruments are developed to measure constructs; hence, one might start with the question: What is to be measured? As mentioned above, measurement instruments are able to measure psychosocial phenomena, however that is a broad field, indeed. It is important to clearly define the goals and objectives of a measurement instrument. For psychometric testing, the goal should be to test one or more psychosocial constructs in a defined context. Firstly, constructs need to be clearly articulated and differentiated and, secondly, the population in which the measurement will be undertaken needs to be defined.[2] These two criteria may be formulated as a research question. A research question (in this field of research, at least) should be a measurable question that clearly articulates the construct/s and the target population of interest. A construct can be defined as informed by a thorough review of the literature, preferably with reference to relationships with other psychosocial constructs.[3] For a detailed procedure of how to define a construct, we refer to Gilliam and Voss.[4] This article sets out a stepwise approach for the development and psychometric testing of measurement instruments to measure latent psychosocial constructs. Latent constructs A construct is considered latent when it is hidden in the mind of the respondent. A non-latent construct can be directly observed, and measured, through a combination of observations. An example of a non-latent construct is absolute cardiovascular disease risk, which mathematically combines gender, age, BP, smoking status, cholesterol, and presence of diabetes and left ventricular heart failure to calculate a risk of experiencing a cardiovascular adverse event within the next five years.[5] Latent constructs include constructs, such as attitudes, satisfaction, motivation, self-efficacy and acceptability. In the social pharmacy literature, a recent study exploring a latent construct of ‘willingness (to use a health service)’ reported that consumers were more willing to use a medication management service (a latent construct), when they were worried (latent) about their medicines and expected the pharmacist to reassure them (latent).[6,7] Measuring latent constructs is challenging because they are commonly abstract in nature. In the example above, the researcher cannot peer directly into the mind of patients, or the respondents, to determine what they really think about the medication management service. Reflective vs formative measurement Typically, a latent construct is not measured directly, only indirectly by interpreting the effects it has on observed variables.[8] In this context, observed variables are typically responses to questions in a scale. Generally, a latent construct is comprised of items (indicators) that are reflective of the construct, rendering it a reflective construct.[9,10] In a graphical sense, the direction of causality moves from the construct to the item.[11] Characteristics of scales that measure reflective constructs include:[9,10] items do not cause variation in the construct; the items can be interchangeable; and adding or dropping items does not change the nature of the construct. In formative measurement, the construct arises through combining or summing the items. The direction of causality is in the opposite direction, moving from the item to the construct. The construct is dependent on the items and is described by the items.[9,10] Characteristics of scales that measure formative constructs include:[9,10] items do not need to a share a common theme; items are not interchangeable; and adding or dropping items fundamentally changes the nature of the construct. To some extent, this is a slightly simplified explanation because as Brown[12] elucidates, some latent constructs can be measured by a combination of reflective and formative measurement models. For further information regarding latent constructs, we refer to DeVellis.[11] Development A measurement instrument contains a list of one or more questions or statements. These questions or statements are commonly termed ‘items’ in psychometric applications. Item construction is a critical element of psychometric testing. While statements and questions are both used as items, they both require an accompanying response format (scale) as the intention is to quantify the response. Therefore, open-ended questions are seldom seen in psychometric scales.[13] There has been a great deal of published research exploring how people respond to items in a survey. Krosnick[14] provides a summary of how this would ideally occur, from a research perspective, in that respondents would (modified from Krosnick):[14] Understand the intent of the item; Recall prior knowledge from memory; Formulate a summary judgement; and Translate their judgement to the provided response options. Item generation Item generation requires generation of the statement or question (sometimes referred to as the item stem), with an accompanying response format. There are several processes involved in constructing a suitable item for testing. Ideally, it is important to ground the item in prior knowledge, as supported by a rigorous literature review. There are several sources that may be utilised: prior surveys, qualitative data and theory, to ensure that, collectively, the items measure what we intend for them to measure. Items should typically be formulated with clear, unambiguous language. Items are most commonly positively worded (positive valence) for optimal comprehension by the respondent. If there is concern that respondents may not be adequately considering each item response, for example due to survey fatigue or respondent acquiescence, consideration could be paid to the addition of items with opposite polarity to the scale. The risk is that items with opposing valencies perform poorly in analysis and it is recommended to avoid if possible.[11] Item wording should also be neutral, that is not leading the respondent to answer in a particular manner. Items should also refer to a single construct or issue to ensure clarity.[15] We refer readers to DeVellis for further detail regarding item generation.[11] Next, it is also important to consider generating a sufficient number of items, to create an item pool. When deciding on the size of the item pool, researchers might consider the following competing principles: that there are sufficient items to cover the breadth of the construct. the concept of parsimony – that it is best to use the minimum number of items in order to reduce survey fatigue. Furthermore, it is generally accepted that the statistical methods (e.g. factor analysis) used to validate the scientific soundness of a construct will require multiple items. Typically, three or more items for each dimension provide useful statistical information about shared variance.[16] While it is possible to conduct factor analyses with fewer than three items for each dimension, the issue of model ‘identification’ requires extra consideration.[17] Until the factor analyses are conducted (see below), it is not known how many items will eventually ‘fit’ and so more items should be generated at this early stage. After generating the item stem, the second step is to consider the response format. Potential considerations are (modified from Furr):[13] Type of response; For example, Likert-type (agree/disagree), frequency (always, sometimes never), semantic differential (paired adjectives, e.g. happy, sad). Number of response options For example, for Likert-type scales, 5 or 7 points are commonly used,[18] acknowledging that increasing options can reduce ‘bunching’ of responses at the tail end of distributions (skewness) and reduce the sharpness of the peak around the mean (kurtosis). However, higher numbers of options may make it difficult for respondents to discriminate between response options, resulting in decreased measurement quality.[19] Labels/Anchors Should responses be numerically labelled? If so, which ones? Mid-points An even number of response options without a mid-point forces respondents to choose a side. This may avoid neutral or mid-point responses, which respondents may choose due to item difficulty. ‘No opinion’ or ‘I don’t know’ Similar to providing a mid-point, an ‘I don’t know’ option may invite respondents to satisfice and exert less cognitive effort.[20] It may also lead to missing data. A useful guide for item construction for question/statement and response is presented in Schaeffer and Dykema.[21] Finally, when assembling items to constitute the instrument, it is important to consider the order of presentation. The order and visual presentation of items can affect response. It is also important to ensure consistency in response format across all items within a survey, as this will affect summed data.[13] Pretesting Despite paying careful attention when developing a measurement instrument to ensure it is fit for purpose, it may still require alteration prior to its use in data collection. Pretesting of the instrument is essential to avoid costly or even catastrophic error when collecting data. We refer to Presser and colleagues[22] who provide an overview of techniques used to pretest instruments, such as interviews with pilot respondents, cognitive interviews, measurement of response latency and statistical techniques such as latent class analysis.[22] Face and content validity Validity refers to the ability of the instrument to measure what it intends to measure.[23] When developing a measurement instrument, the first types of validity that are often explored are content and face validity of the items, as they are determined at the item generation and selection stage. Content validity is the degree to which the items are relevant to and representative of the defined construct[2] and is an integral component when developing measurement instruments. There are several quantifiable techniques to test for content validity although they are predominately measures of agreement between raters. The raters are typically identified as content experts. We refer readers to a recent paper by Almanasreh and colleagues,[24] which provides an overview of methods to explore content validity. Face validity, on the other hand, is a judgement of whether the items truly measure what they intend to measure, with the rater being for whom the scale is intended, rather than a content expert.[25] Although some researchers dismiss the importance of face validity relative to content validity,[24] it remains an important consideration when constructing a valid measurement instrument. Face validity, logically, is crucial to ascertain when researchers have less information about how potential respondents would respond to items, keeping in mind that an optimal response would meet the four criteria, provided by Krosnick (see earlier).[14] Reliability Reliability refers to whether the measurement instrument produces consistent measurements over repeated administrations,[23] and is specific to the population among whom it is estimated.[26] There are various forms of reliability that are relevant to the psychometric testing of measurement instruments. Test–retest reliability, for example, measures the instrument’s ability to produce consistent results over time.[23,27] Another form of reliability refers to internal consistency, which estimates how closely a set of items or questions are related. Commonly, Cronbach’s alpha or coefficient alpha is used to measure internal consistency;[27] however, the mathematical basis for Cronbach’s alpha assumes tau equivalence, which, in real-world settings, is unlikely to be met.[28–30] We refer to Mirzaei et al.[31] for an example of how coefficient omega and coefficient alpha are used to report scale reliability of a perceived service quality measure in the social pharmacy literature. A recent paper in the educational literature highlights the importance of understanding more about the factorial structure when considering reports of high internal consistency as claims of scale reliability.[16] An important issue is that a high Cronbach’s alpha can be calculated for multidimensional scales (when one would not expect high internal consistency), particularly when scales include a high number of items. The concept of identifying, measuring and scoring multidimensional scales is dealt with in subsequent sections of this manuscript. Sample selection A key property of a psychometrically sound instrument measuring a latent construct is that it is generalisable. This means that the concept/s that the scale is measuring is/are transferable between those persons that have been measured by the scale and those that have yet to be measured. Accordingly, it is critical that the sample is representative of the population of interest. Inadequate sample selection can lead to poor precision and low accuracy, thereby affecting the quality of the quantified estimate based on the sample responses.[32] Samples are drawn from the sampling frame, being the proportion of the population from which we are able to sample. Ideally, the sampling frame is the entire population of interest (and no one else) but that is not always the case. For example, if a researcher wanted to ask culturally and linguistically diverse consumers how satisfied they are with the information provided by their pharmacist when having their medication dispensed, it may be difficult to sample from the entire target population (i.e. all culturally and linguistically diverse consumers who take medicine) due to barriers of accessibility or language. Sampling error may consequently occur. Once the sampling frame is considered, a sampling strategy needs to be designed.[33] Several considerations include the sample size, which is informed by the precision desired in analysis (see below), and the method by which the sample is selected. Typically, a random sample is selected. However, stratification of the random sample can be employed to increase the likelihood that a representative sample is obtained. The variables used to stratify a sample should be informed by prior information that the selected variables cause significant variation in the concept/s of interest. For further reading about these concepts, we refer to Fowler.[34] Mode of delivery Once the sample has been identified, the researcher must decide how the measurement instrument will be delivered. Typical modes of delivery include the following: face-to-face, telephone, mail/e-mail, web/Internet/online panel and phone application. Each mode of delivery has advantages and disadvantages. For example, electronic delivery will automatically restrict the sampling frame to potential participants with access to electronics. Alternatively, instruments administered via telephone may be limited by the fact that many people no longer have landlines, which in turn restricts the sampling frame. Consideration should be paid to which modes of delivery are relevant for the population of interest and multiple modes of delivery may be the optimal way of obtaining a representative sample. For further reading about these concepts, we refer to Fowler[34] and Groves et al.[1] Data analysis to determine construct validity Once a scale has been developed to measure a construct, using previously described methods for item generation and selection, it is then necessary to test whether all of the resulting items, do indeed measure the construct, and whether items need to be removed. This process is known as data reduction. Construct validity is often explored using data reduction techniques, which can be exploratory or confirmatory in nature.[8] A construct may consist of one or more dimensions. For example, when exploring consumer perceptions of service quality in community pharmacy, early research indicated that this construct was composed of two dimensions, namely technical and interpersonal quality.[35] However, more recent research in this area has identified six dimensions of perceived service quality.[31,36] When exploring latent psychological constructs, such as perceived service quality, factor analysis (FA) can illustrate the underlying factor structure of a set of interrelated items, which group together to form one or more dimensions, or in this case, factors.[37] Factor analysis Factor analysis is an overarching technique that refers to two different types of statistical analyses, namely exploratory (EFA) and confirmatory factor analysis (CFA).[9,38] Exploratory methods are used to determine the underlying structure of a scale and are useful for ‘exploring’ the unknown characteristics of a measurement instrument. If previous theoretical support is available or if the structure of a scale is known a priori, then the researcher may not need to utilise exploratory methods and may progress to checking the ‘fit’ of an existing model using CFA.[9,38,39] In either exploratory or confirmatory methods, the type of construct influences the data reduction technique chosen, the analysis of the output and the interpretation of the model. Data reduction techniques In order to select the most appropriate data reduction method, it is important to consider the nature of the construct. Recall, that a latent construct is a construct that exists in the mind of the respondent.[40] The most commonly used approaches for data reduction are EFA and principal components analysis (PCA). PCA is often the default data reduction technique in many statistical programs, resulting in its widespread use.[9,38] There is often confusion in the literature regarding the difference between EFA and PCA and whether these two techniques should be used for the same purpose. The potential confusion regarding the differences between these techniques may be due to the fact that both lead to data reduction. However, it is important to recognise that EFA and PCA use different underlying mathematical techniques.[9] As the name suggests, PCA is a technique which leads to the generation of components, rather than factors. PCA is a suitable technique when the aim of the research is to reduce the number of variables and it is typically used for data reduction of formative constructs.[41] PCA was not designed to consider the structure of the correlations among variables, but rather to form a smaller set of measured variables. It focuses on the variances of the measured variables rather than the correlations that exist among them. The principal component model allows the researcher to explain the maximum amount of variance by creating ‘components’ and produces results that are unique to that data set. The components generated from PCA simply represent efficient methods of capturing information in the measured variables, regardless of whether those measured variables represent meaningful latent constructs.[9,10,39] PCA yields constructs that are formative, and as such, the model is specific to the data set, and the results are not generalisable to the wider population. PCA may also be used as a diagnostic test if Heywood cases (near-zero or negative error variances) are evident in EFA.[42] CFA with formative indicators requires special attention to model specification and identification.[43] Structural equation modelling (SEM), but not CFA, can be used for formative constructs, with the technique of partial least squares (PLS). Since formative constructs are outside the scope of this article, readers are referred to Hair.[44] For latent constructs consider using EFA Exploratory factor analysis was originally developed to be a general mathematical framework for understanding correlations among measured variables.[45] In EFA, the pattern of correlations is considered to be influenced by latent constructs or the common factor.[9] The correlations between the variables and patterns among them produce results that are likely to be generalisable to the wider population.[45] EFA techniques, such as maximum likelihood (ML) and principal axis factoring (PAF), are suitable for exploring underlying factors and producing meaningful scores.[41] ML methods generally require a normal distribution. However, the number of factors extracted and the underlying factor loadings are not always severely affected with skewed data.[46] In cases where normality is severely violated, PAF is recommended. Exploratory factor analysis The suitability of the data for EFA (factorability) Prior to conducting EFA, it is important to explore the data set and determine whether this type of analysis is suitable. When conducting EFA, a general rule of thumb is that the larger the sample, the better. However, there is a lack of consensus as to what constitutes a minimum sample size. Furthermore, the size of the sample is also dependent on the number of items, hence a common way to determine the minimum required sample size is using the subject-to-item ratio.[47,48] Despite a suggested minimum ratio of 5:1, studies are still published with much lower ratios than this.[45,47,48] There are a number of ‘checks’ that can also be conducted to ensure that EFA is a suitable technique.[49] If the determinant value is greater than 0.00001, then this is a good initial indicator that the data are not multicollinear.[48,49] Another output generated is the Kaiser–Meyer–Olkin measure of sampling adequacy (KMO), which ranges between 0 and 1, with higher values indicating that the data are more suitable for EFA.[49–51] Finally, a significant (P < 0.05) Bartlett’s test of sphericity also indicates the factorability of the data.[49,52] Rotation methods When conducting data reduction, one decision to be made is the method of rotation. While it is possible to conduct EFA without rotation, interpretation of the derived factors is more straightforward with rotation.[53] There are two main categories of rotation, one being orthogonal rotation which is the rotation method of choice when the factors are thought to be uncorrelated. There are various types of orthogonal rotation, including varimax, quartimax and equamax methods. The other broad category of data rotation is oblique rotation, which is more suitable for factors that are expected to be correlated. There are various types of oblique rotation, including direct oblimin, quartimin and promax. These two rotational methods provide factors that are either uncorrelated (orthogonal methods) or correlated (oblique methods). Thus, a researcher should consider the relationship between factors a priori. If factors are uncorrelated, then both methods should provide similar results; however, it is difficult to assume no correlation exists with real-world data when exploring latent constructs. Factor retention Once the initial analysis is generated, the researcher needs to decide on how many factors to retain. There are various methods for determining the number of factors to retain. When trying to decide, using a combination of objective and subjective methods is advised. Eigen values are a reflection of how much of the variance of the total sample is explained by a factor.[39] The general rule is the K1 rule, which specifies retaining all factors that have an Eigen value of 1 or greater.[54] The K1 rule is an arbitrary rule, which is consistently used within the literature. The scree plot can also be used to determine how many factors to retain.[55] The scree plot is usually automatically generated in the software used to conduct FA. The researcher must then examine the scree plot and determine the point at which the line ‘kinks’ or bends. The number of factors prior to this point is usually retained. Another tool to aid in this decision is parallel analysis, which uses a randomly generated sample of the same observations from the original data and creates Eigen values based on this random data set.[41,49] The researcher is then required to compare the Eigen values of the data set with the Eigen values generated from parallel analysis. If the Eigen value generated from the true data set is larger than that generated from the random data set, then the factor is retained. Once the random Eigen values exceed those generated from the true data set, then the factors are no longer retained. Given the tendency of both the K1 rule and the scree plot to over-dimensionalise,[56] it is recommended that parallel analysis be used for all EFA studies. Programs for conducting parallel analysis are available for SPSS, SAS and R,[57] whereas in Mplus, this is an option and readers are referred to the program documentation. Eventually, however, researchers may need to use a combination of these techniques along with their own conceptual understanding of the topic, to decide on the number of factors to retain. Item deletion Every item in a scale will load onto every factor, when analysed using EFA. However, the strengths of these loadings will differ. In EFA, items can load onto factors strongly or weakly, but there is no consensus on the optimal cut-off for a ‘strong’ loading. Nonetheless, a value between 0.4 and 0.45 seems to be widely accepted in the literature.[39,41] Furthermore, given that items load onto multiple factors, there will obviously be a difference between the numerical values of each loading. In general, the larger the difference between loadings of the same item on different factors, the better, because it allows the researcher to make a decision, with greater certainty, as to which factor the item delineates. Items that have poor loadings on all factors or high loadings with little difference on multiple factors may require deletion as these numerical outputs are often indicators of poor discrimination between factors, and ultimately a lack of construct validity.[41,58] One must consider which items to delete, but also how many items to delete. It was mentioned above that ideally latent constructs should have a minimum of three items. If you are wondering whether to delete one of three items, you will likely ask yourself the question: What is the minimum number of items needed for a factor? When there are only 2 items within a factor, the only information available about the internal validity of the construct is the correlation between the 2 items. Including more than 2 items provides significantly more information regarding shared variance. For more detailed information on how to conduct EFA, we refer to Costello and Osborne,[45] Hair,[39] Preacher & MacCallum,[59] Tabachnick and Fidell,[41] Fabrigar and Wegener[9] and Kaplan.[38] Confirmatory factor analysis Once a researcher has created a ‘model’ of the latent construct using data reduction techniques, the next step is to test the model using a separate data set. This test may be performed using CFA. CFA is actually a form of SEM.[60,61] In CFA, the researcher hypothesises that the items chosen to represent the latent construct (the model) actually do so. This is done by writing a series of simple regression equations including error terms. Once the model is created, a statistical test is performed to determine how well the model ‘fits’ the data. The object of the exercise is to find a model which has a small and non-significant chi-square. That said, for most pharmacy practice research, a non-significant chi-square is unlikely and a lot of attention is given to absolute and comparative ‘fit’ statistics (see below). Similar to EFA, one of the first considerations for most researchers is whether there will be a sufficient number of responses – that is, Is the data set large enough? There remains considerable debate about this topic; however, many researchers use a 10 : 1 or 20 : 1 subject-to-item ratio for CFA. Therefore, for a 20-item scale, 200–400 responses would be required. However, many factors influence this, such as the normality of the data, size of the loadings, size of the correlations between factors and model complexity.[62] Recent guidance on sample size for CFA suggests that a minimum of 100 cases is necessary for most applications and that researchers should also consider the sampling frame and generalisability (as discussed earlier).[63] Researchers with little experience in CFA will find programs with graphical interfaces easiest to use, such as AMOS (http://www.amosdevelopment.com) and EQS (http://www.mvsoft.com/products.htm). The graphical interface helps the researcher ‘draw’ the model, rather than having to write the mathematical equations. Actually, the equations are not that complicated but the syntax needs to be absolutely correct for the analysis to work. For researchers with expertise in writing syntax, LISREL (http://www.ssicentral.com), Mplus (http://www.statmodel.com) and R (http://www.r-project.org) are recommended applications. R is useful if there are budgetary limitations, because it is an open-source software. However, the choice of program will also depend on the data. What the researcher needs to decide is whether the data are reasonably normally distributed, because testing of hypotheses generally depends on using ML methods. It is important to note that the data available from most survey research will not be normally distributed. Therefore, how ‘non-normal’ the data are, will influence the selection of the software. Programs that offer ‘robust’ statistical methods, which can overcome the issues related to non-normality, should be used for most survey data. Robust methods provide a mathematical correction to reduce the inflated chi-square, which is characteristic of non-normal data, see Finney & DiStefano[64] or Brown[12] (Chapter 9). Here, AMOS is a little limiting because in order to use its robust method, there can be no missing data points in any items for any cases. EQS, Mplus and R offer robust methods in the presence of missing data. Missing data Survey data are notorious for having missing responses. ML requires that any cases with missing responses are either deleted listwise or pairwise (which can dramatically reduce the number of responses); otherwise, imputation methods may be used.[65] However, the most appropriate imputation method requires knowledge of the type of missingness. A modern approach to missingness is to use full information maximum likelihood (FIML) methods, which utilise every available response in the presence of some missingness. FIML methods generally produce unbiased parameter estimates even in the presence of non-normal data when the data are missing at random or missing completely at random.[66] Model fit Much has been written about model fit and while the literature offers reasonably well-accepted guidelines, there really is no absolutely correct solution until the chi-square is not significantly different from zero.[63,67,68] A non-significant chi-square is much more likely with small models and should be expected with degrees of freedom less than six.[63] Most models will have far more degrees of freedom and so comparative fit indices assist with interpretation. We recommend, as a minimum, reporting the chi-square with degrees of freedom and significance, along with the following as targets: root mean square error of approximation (RMSEA) <0.06, comparative fit index (CFI) <0.95 and Tucker–Lewis index (TLI) <0.95. Recall, that these are targets and are sometimes difficult to achieve, particularly in preliminary studies. Not meeting these targets is not necessarily a reason to prevent authors or reviewers from publishing the results; however, when either the RMSEA is >0.08 or the CFI and/or TLI is <0.9, this is indicative of very poor fit, and the results are unlikely to be generalisable. The Akaike information criterion (AIC) can be used to test the fit of alternative conceptual models, if required. It is important to report other statistical output from the hypothesis testing. As a minimum, we recommend reporting unstandardised regression weights (URW) with 95% confidence intervals and standardised regression weights (SRW), which should be above 0.7 for each item. In addition, we also recommend reporting statistics about the constructs themselves and the correlations between the constructs. Construct validity is demonstrated when the average variance extracted (AVE) >50% and construct reliability (CR) >0.7.[69] These statistics are akin to internal consistency tests such as Cronbach’s alpha and omega. Note that most programs do not provide AVE or CR as output, but they can be calculated from SRW.[70] These statistics can also be useful for determining whether two constructs are actually separate or whether they might just as well be modelled as a single construct.[69] This test of separateness of the constructs refers to tests of discriminant validity. The separateness of the constructs also affects the measurement or scoring of the construct and this is discussed below. For further information on this topic, we refer to Carter.[70] Model trimming and alternative models Sometimes in preliminary studies, it may be appropriate to improve model fit by trimming or making other modifications. Model trimming involves removing items from the model which have poor loading (SRW below 0.7). A SRW below 0.7 indicates that the variation in that item is explained by <50% of the variation in the construct. It is recommended that fewer than 10% of items should be deleted; otherwise, the modelling is considered too opportunistic or exploratory. Another common way to modify the model is to introduce correlated error terms. If items have correlated error terms, this implies that there is an external influence on the way respondents answered two of the questions (besides the construct). A positive correlated error can occur when two items are worded very similarly. A negative correlated error can occur between a positively and negatively worded item probing the same concept. While a simple CFA does not provide information about which items might have correlated error, the statistical programs will provide documentation about how to access modification indices. Here, it is useful to reflect on the guidance of Schreiber,[60] who suggests that CFA should be used to test for alternative models, based on competing conceptual understanding of the construct. Fit statistics should be used to compare models. When comparative models are substantively different, the AIC should be reported. Not all hypothesised models will produce an acceptable CFA result. Researchers may need to consider re-analysing the data using EFA to determine whether another conceptual model may represent the data better. It would then be appropriate to obtain data from a new source to repeat the CFA. Measurement of the constructs If a latent construct is to be measured, the researcher will need to make decisions about how to ‘score’ the construct. For a uni-dimensional construct, the simplest approach is to calculate a summated score. That is, for each participant, simply take the mean score for each of the responses (remembering that the denominator may vary with missing responses). However, there are more sophisticated approaches. One of the potential outputs of EFA and CFA are calculations of ‘factor scores’. For each dimension, it is possible to calculate and retain a factor score which weights items with the highest loadings. A summary of how to decide on the best approach is provided by DiStefano et al.[71] If the construct of interest is multidimensional, then a decision has to be made as to whether to combine the sub-dimensions into an overall score or whether it is only valid to score each sub-dimension, separately. If the tests of discriminant validity (mentioned above) show that there is more than one construct, the scores may need to be reported and treated separately. Summing the scores of separate sub-constructs together does not make a great deal of sense, unless all the correlations between them are very high. Scoring is important for researchers exploring relationships between variables, but becomes even more salient for scoring at the individual level, for example when a clinical psychologist uses a measurement instrument to design treatment based on estimates of a patient’s thoughts, feelings and behaviour. A new decision support strategy has been proposed which includes modelling a single dimension, along with factors for each sub-dimension. These so-called bifactor models provide additional statistical information to help inform scoring decisions.[72] An example, illustrating how bifactor modelling can assist scoring decisions is available in a study investigating the factorial structure of the Hospital Anxiety and Depression Score.[73] The authors concluded that high factor loadings on a general factor (named general distress) meant that it was more valid to score overall general distress, rather than anxiety and depression separately as independent measures. Nonetheless, it is important to remember that while statistics can get us so far, the way separate sub-dimensions of constructs are used requires conceptual understanding and explanation.[71] Of course, one way to use single or multiple latent variables would be to conduct a fully latent structural regression model or a structural equation model (SEM). In this case, the measurement of the latent factors occurs concurrently with measurement of influence.[74] Multi-group CFA Confirmatory factor analysis can be used to test whether a construct is valid among separate populations or groups. Indeed, the ability to compare the way respondents answer scales across groups is one of the most powerful aspects of CFA. Measurement invariance is the technique which provides information about the degree to which the constructs are comparable across populations. Configural invariance, the most superficial level, is demonstrated when people from different groups employ the same conceptual framework to answer the test items. Knowing whether higher levels of invariance, including metric and scalar invariance exist, can assist with decisions as to whether the measurement instruments should be scaled differently between groups. This is a fairly technical procedure, and readers are referred to Wu et al. for practical guidance in this area.[75] Conclusion This manuscript is intended to provide social pharmacy researchers and reviewers with an introductory overview of the key concepts relating to the development and psychometric testing of instruments measuring latent constructs. By paying careful attention to the guidance provided here and within the reference list, the development of psychometrically sound measurement instruments will be subsequently facilitated. The area of psychometrics is an emerging area. We have attempted to summarise the key principles as at the time of publication. However, researchers working in this area should also stay up-to-date as new techniques emerge and new standards of practice are established. Declarations Conflict of interest None of the authors have any conflicts of interest to declare. Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors. Authors’ Contributions Each author contributed significantly to the development of the manuscript. All Authors state that they had complete access to the study data that support the publication. References Groves RM et al. . Survey Methodology , 2nd edn. Hoboken, NJ : John Wiley & Sons , 2009 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Haynes SN , Richard D, Kubany ES. Content validity in psychological assessment: a functional approach to concepts and methods . Psychol Assess 1995 ; 7 : 238 . Google Scholar Crossref Search ADS WorldCat Clark LA , Watson D. Constructing validity: basic issues in objective scale development . Psychol Assess 1995 ; 7 : 309 . Google Scholar Crossref Search ADS WorldCat Gilliam DA , Voss K. A proposed procedure for construct definition in marketing . Eur J Market 2013 ; 47 : 5 – 26 . Google Scholar Crossref Search ADS WorldCat National Vascular Disease Prevention Alliance ( 2012 ). Australian absolute cardiovascular risk calculator . http://www.cvdcheck.org.au/ (Accessed 2 January 2020). Carter SR et al. . Consumers' willingness to use a medication management service: The effect of medication-related worry and the social influence of the general practitioner . Res Soc Admin Pharm 2013 ; 9 : 431 – 445 . Google Scholar Crossref Search ADS WorldCat Carter SR et al. . Patients' willingness to use a pharmacist-provided medication management service: the influence of outcome expectancies and communication efficacy . Res Soc Admin Pharm 2012 ; 8 : 487 – 498 . Google Scholar Crossref Search ADS WorldCat Thompson B , Daniel LG. Factor analytic evidence for the construct validity of scores: a historical overview and some guidelines . Educ Psychol Measure 1996 ; 56 : 197 – 208 . Google Scholar Crossref Search ADS WorldCat Fabrigar LR , Wegener DT. Exploratory Factor Analysis . New York, NY : Oxford University Press , 2012 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Suhr DD . Principal component analysis vs. exploratory factor analysis . SUGI 30 Proc 2005 ; 203 : 230 . Google Scholar OpenURL Placeholder Text WorldCat DeVellis RF . Scale Development: Theory and Applications , 4th edn. Chapel Hill : Sage Publications Inc. , 2016 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Brown TA . Confirmatory factor analysis for applied research , 2nd edn. New York, NY : The Guilford Press , 2015 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Furr M . Scale Construction and Psychometrics for Social and Personality Psychology . Thousand Oaks CA : SAGE Publications Ltd , 2011 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Krosnick JA . Questionnaire design. The Palgrave Handbook of Survey Research . Cham, Switzerland : Springer , 2018 : 439 – 55 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Marsden PV , Wright JD. Handbook of Survey Research . Bingley, UK : Emerald Group Publishing , 2010 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Taber KS . The use of Cronbach’s alpha when developing and reporting research instruments in science education . Res Sci Educ 2018 ; 48 : 1273 – 1296 . Google Scholar Crossref Search ADS WorldCat Davis WR . The FC1 rule of identification for confirmatory factor analysis: a general sufficient condition . Soc Methods Res 1993 ; 21 : 403 – 437 . Google Scholar Crossref Search ADS WorldCat Malhotra NK , Peterson M. Basic marketing research: a decision-making approach . Upper Saddle River, N.J. : Pearson/Prentice Hall , 2006 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC DeCastellarnau A . A classification of response scale characteristics that affect data quality: a literature review . Qual Quant 2018 ; 52 : 1523 – 1559 . Google Scholar Crossref Search ADS PubMed WorldCat Krosnick JA et al. . The impact of “no opinion” response options on data quality: non-attitude reduction or an invitation to satisfice? Public Opin Quart 2002 ; 66 : 371 – 403 . Google Scholar Crossref Search ADS WorldCat Schaeffer NC , Dykema J. Questions for surveys: current trends and future directions . Public Opin Quart 2011 ; 75 : 909 – 961 . Google Scholar Crossref Search ADS WorldCat Presser S et al. . Methods for testing and evaluating survey questions . Public Opin Quart 2004 ; 68 : 109 – 130 . Google Scholar Crossref Search ADS WorldCat Ginty AT . Psychometric properties . In: Gellman MD, Turner JR, eds. Encyclopedia of Behavioral Medicine . New York, NY : Springer New York 2013 : 1563 – 1564 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Almanasreh E , Moles R, Chen TF. Evaluation of methods used for estimating content validity . Res Soc Admin Pharm 2019 ; 15 : 214 – 221 . Google Scholar Crossref Search ADS WorldCat Nevo B . Face validity revisited . J Educ Meas 1985 ; 22 : 287 – 293 . Google Scholar Crossref Search ADS WorldCat Vitoratou S , Pickles A. A note on contemporary psychometrics . J Ment Health 2017 ; 26 : 486 – 488 . Google Scholar Crossref Search ADS PubMed WorldCat Mokkink LB et al. . The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study . Qual Life R . 2010 ; 19 ( 4 ): 539 – 549 . https://doi.org/10.1007/s11136-010-9606-8 Google Scholar Crossref Search ADS WorldCat Crutzen R , Peters G-J. Scale quality: alpha is an inadequate estimate and factor-analytic evidence is needed first of all . Health Psychol Rev 2017 ; 11 : 242 – 247 . Google Scholar Crossref Search ADS PubMed WorldCat Dunn T , Baguley T, Brunsden V. From alpha to omega: a practical solution to the pervasive problem of internal consistency estimation . Br J Psychol 2014 ; 105 : 399 – 412 . Google Scholar Crossref Search ADS PubMed WorldCat Peters G-J . The alpha and the omega of scale reliability and validity: why and how to abandon Cronbach’s alpha and the route towards more comprehensive assessment of scale quality . Eur Health Psychol 2014 ; 16 : 56 – 69 . Google Scholar OpenURL Placeholder Text WorldCat Mirzaei A et al. . Development of a questionnaire to measure consumers’ perceptions of service quality in Australian community pharmacies . Res Soc Adm Pharm 2018 ; 15 : 346 – 357 . Google Scholar Crossref Search ADS WorldCat Groves RM , Lyberg L. Total survey error: past, present, and future . Public Opin Quart 2010 ; 74 : 849 – 879 . Google Scholar Crossref Search ADS WorldCat Stoop I , Harrison E. Classification of surveys. Handbook of survey methodology for the social sciences . New York, NY : Springer , 2012 : 7 – 21 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Fowler FJ Jr. Survey Research Methods . Thousand Oaks, CA : Sage publications , 2013 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Holdford D , Schulz R. Effect of technical and functional quality on patient perceptions of pharmaceutical service quality . Pharm Res 1999 ; 16 : 1344 - 1351 . Google Scholar Crossref Search ADS PubMed WorldCat Grew B et al. . Validation of a service quality questionnaire in community pharmacy . Res Soc Adm Pharm 2019 ; 15 ( 6 ): 673 – 681 . Google Scholar Crossref Search ADS WorldCat Goodwin LD . The role of factor analysis in the estimation of construct validity . Measure Phys Educ Exer Sci 1999 ; 3 : 85 – 100 . Google Scholar Crossref Search ADS WorldCat Kaplan D . Structural Equation Modeling: Foundations and Extensions . Thousand Oaks, CA : Sage Publications 2008 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hair JF et al. . Multivariate Data Analysis . New Jersey : Pearson Prentice Hall , 2006 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC O'Malley A , Neelon B. Latent factor and latent class models to accommodate heterogeneity, using structural equation . In: Culyer , AJ, ed. Encyclopedia of Health Economics , 1st edn. Amsterdam, Netherlands : Elsevier , 2014 ; 131 – 40 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Tabachnick BG , Fidell LS. Using Multivariate Statistics . Boston : Pearson Education , 2013 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Kolenikov S , Bollen KA. Testing negative error variances: is a Heywood case a symptom of misspecification? Soc Methods Res 2012 ; 41 : 124 – 167 . Google Scholar Crossref Search ADS WorldCat Jarvis C , MacKenzie S, Podsakoff P. A critical review of construct indicators and measurement model misspecification in marketing and consumer research . J Consumer Res 2003 ; 30 : 199 – 218 . Google Scholar Crossref Search ADS WorldCat Hair JF et al. . Partial least squares structural equation modeling (PLS-SEM): an emerging tool in business research . Eur Business Rev 2014 ; 26 : 106 – 121 . Google Scholar Crossref Search ADS WorldCat Costello AB , Osborne JW. Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis . Pract Assess, Res Eval 2005 ; 10 : 1 – 9 . Google Scholar OpenURL Placeholder Text WorldCat Kasper D , Unlü A. On the relevance of assumptions associated with classical factor analytic approaches . Front Psychol 2013 ; 4 : 109 – 09 . Google Scholar Crossref Search ADS PubMed WorldCat Nunnally JO . Psychometric Theory . New York : McGraw-Hill , 1978 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Yong AG , Pearce P. A beginner’s guide to factor analysis: focusing on exploratory factor analysis . Tutorials Quant Methods Psychol 2013 ; 9 : 79 – 94 . Google Scholar Crossref Search ADS WorldCat Pallant JF . SPSS Survival Manual: A Step by Step Guide to Data Analysis Using IBM SPSS Sydney . Sydney : Allen & Unwin , 2016 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Kaiser HF . An index of factorial simplicity . Psychometrika 1974 ; 39 : 31 – 36 . Google Scholar Crossref Search ADS WorldCat Kaiser HF . A second generation little jiffy . Psychometrika 1970 ; 35 : 401 – 15 . Google Scholar Crossref Search ADS WorldCat Bartlett MS . A note on the multiplying factors for various chi square approximations . J Royal Stat Soc 1954 ; 16 ( Series B ): 296 – 298 . Google Scholar OpenURL Placeholder Text WorldCat Osborne JW . What is rotating in exploratory factor analysis? Pract Assess, Res Eval 2015 ; 20 : 1 – 7 . Google Scholar OpenURL Placeholder Text WorldCat Kaiser HF . The application of electronic computers to factor analysis . Educ Psychol Measure 1960 ; 20 : 141 – 151 . Google Scholar Crossref Search ADS WorldCat Cattell RB . The scree test for the number of factors . Multivariate Behav Res 1966 ; 1 : 245 – 276 . Google Scholar Crossref Search ADS PubMed WorldCat van der Eijk C , Rose J. Risky business: factor analysis of survey data – assessing the probability of incorrect dimensionalisation . PLoS ONE 2015 ; 10 : e0118900. Google Scholar OpenURL Placeholder Text WorldCat O'Connor B . SPSS and SAS programs for determining the number of components using parallel analysis and Velicer's MAP test . 2000 https://people.ok.ubc.ca/brioconn/nfactors/nfactors.html (accessed December 6 2019). Ferguson E , Cox T. Exploratory factor analysis: a users’ guide . Int J Select Assess 1993 ; 1 : 84 – 94 . Google Scholar Crossref Search ADS WorldCat Preacher KJ , MacCallum RC. Repairing tom swift's electric factor analysis machine . Understand Stat 2003 ; 2 : 13 – 43 . Google Scholar Crossref Search ADS WorldCat Schreiber JB . Core reporting practices in structural equation modeling . Res Soc Admin Pharm 2008 ; 4 : 83 – 97 . Google Scholar Crossref Search ADS WorldCat Schreiber JB . Update to core reporting practices in structural equation modeling . Res Soc Admin Pharm 2017 ; 13 : 634 – 643 . Google Scholar Crossref Search ADS WorldCat Wolf EJ et al. . Sample size requirements for structural equation models: an evaluation of power, bias, and solution propriety . Educ Psychol Measure 2013 ; 76 : 913 – 934 . Google Scholar Crossref Search ADS WorldCat Hair JF , Babin BJ, Krey N. Covariance-based structural equation modeling in the journal of advertising: review and recommendations . J Adv 2017 ; 46 : 163 – 177 . Google Scholar Crossref Search ADS WorldCat Finney SJ , DiStefano C. Nonnormal and categorical data in structural equation modeling , 2nd edn. Charlotte, NY : Information Age Publishing , 2013 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Baraldi AN , Enders CK. An introduction to modern missing data analyses . J School Psychol 2010 ; 48 : 5 – 37 . Google Scholar Crossref Search ADS WorldCat Cham H et al. . Full information maximum likelihood estimation for latent variable interactions with incomplete indicators . Multivariate Behav Res 2017 ; 52 : 12 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat Marsh HW , Hau K-T, Wen Z. In search of golden rules: comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler's (1999) Findings . Struct Equ Model: Multidisciplinary J 2004 ; 11 : 320 – 341 . Google Scholar Crossref Search ADS WorldCat Lt Hu , Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives . Struct Equ Model: Multidisciplinary J 1999 ; 6 : 1 – 55 . Google Scholar Crossref Search ADS WorldCat Fornell C , Larcker DF. Evaluating structural equation models with unobservable variables and measurement error . J Market Res 1981 ; 18 : 39 – 50 . Google Scholar Crossref Search ADS WorldCat Carter SR . Using confirmatory factor analysis to manage discriminant validity issues in social pharmacy research . Int J Clin Pharm 2016 ; 38 : 731 – 37 . Google Scholar PubMed OpenURL Placeholder Text WorldCat DiStefano C , Zhu M, Mîndrilă D. Understanding and using factor scores: considerations for the applied researcher. Available online at http://pareonline.net/getvn.asp?v=14&n=20 [Accessed: 6th August 2019]. Pract Assess, Res Eval 2009 ; 14 : 1 – 11 . Google Scholar OpenURL Placeholder Text WorldCat Rodriguez A , Reise SP, Haviland MG. Evaluating bifactor models: calculating and interpreting statistical indices . Psychol Methods 2016 ; 21 : 137 – 150 . Google Scholar Crossref Search ADS PubMed WorldCat Iani L , Lauriola M, Costantini M. A confirmatory bifactor analysis of the hospital anxiety and depression scale in an Italian community sample . Health Quality Life Outcomes 2014 ; 12 : 84 . Google Scholar Crossref Search ADS WorldCat Anderson JC , Gerbing DW. Structural equation modeling in practice: a review and recommended two-step approach . Psychol Bull 1988 ; 103 : 411 – 423 . Google Scholar Crossref Search ADS WorldCat Wu AD , Li Z, Zumbo BD. Decoding the meaning of factorial invariance and updating the practice of multi-group confirmatory factor analysis: a demonstration with TIMSS data . Pract Assess, Res Eval 2007 ; 12 : https://scholarworks.umass.edu/pare/vol12/iss1/3 (Accessed 7 January 2020). Google Scholar OpenURL Placeholder Text WorldCat © 2020 Royal Pharmaceutical Society This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) © 2020 Royal Pharmaceutical Society TI - How to measure a latent construct: Psychometric principles for the development and validation of measurement instruments JO - International Journal of Pharmacy Practice DO - 10.1111/ijpp.12600 DA - 2020-07-09 UR - https://www.deepdyve.com/lp/oxford-university-press/how-to-measure-a-latent-construct-psychometric-principles-for-the-CLrWHSGSg4 SP - 326 EP - 336 VL - 28 IS - 4 DP - DeepDyve ER -