Methodology Review: Principles, Procedures, and Findings in the Application of Background Data MeasuresMumford, Michael D.; Owens, William A.
doi: 10.1177/014662168701100101pmid: N/A
This paper provides a review and critique of the background data literature. As a life history measure, the effective application of background data items is based on a developmental strategy in which a pattern of prior behavior and experiences is related to certain forms of criterion performance. This principle pro vides a framework for discussing the various issues in volved in generating an adequate pool of background data items. The four principal methods for scaling background data items are examined: rational scaling, factorial scaling, empirical keying, and subgrouping. The relative strengths and weaknesses of these four techniques are considered along with current research needs in each area. This review indicates that substan tial progress has been made in the development and application of background data measures, but that al ternatives to the traditional empirical keying strategy should receive more attention.
A Procedure for Standardizing Individually Administered Tests, Normed by Age or Grade LevelAngoff, William H.; Robertson, Gary J.
doi: 10.1177/014662168701100102pmid: N/A
This paper describes a method for standardizing in dividually administered tests, which, for practical rea sons, are ordinarily normed on small samples. Typi cally, these samples yield irregular distributions at each age level, and irregular trends in means and stan dard deviations across age levels. To ameliorate this situation, a procedure is presented that uses the fitted progression of data across the age levels to develop a normalized aggregate distribution of all available cases, with appropriate corrections for the moments of the individual distributions. This aggregate distribution is used to represent the norms at every age level, after adjustment for differences in level and dispersion. The procedure produces norms with improved stability and comparability, and it yields a smooth, lawful progres sion of scores from one age to the next. To the extent that the samples of children at every age level repre sent the same cohort, except for differences in level and dispersion of scores associated with age changes, these data will closely approximate a set of longitudi nal norms, but captured at one point in time.
A Monte Carlo Investigation of Several Person and Item Fit Statistics for Item Response ModelsRogers, H. Jane; Hattie, John A.
doi: 10.1177/014662168701100103pmid: N/A
This study investigated the behavior of several per son and item fit statistics commonly used to test and obtain fit to the one-parameter item response model. Using simulated data for 500 persons and 15 items, the sensitivity of the total-t, mean-square residual, and between-t fit statistics to guessing, heterogeneity in discrimination parameters, and multidimensionality was examined. Additionally, 25 misfitting persons and a misfitting item were generated to test the power of the three fit statistics to detect deviations in a subset of observations. Neither the total-t nor the mean-square residual were able to detect deviation from any of the models fitted. Use of these statistics appears to be un warranted. The between-t was a useful indicator of guessing and heterogeneity in discrimination parame ters, but was unable to detect multidimensionality. These results show that use of person and item fit statistics to test and obtain overall fit to the one- parameter model can lead to acceptance of the model even when it is grossly inappropriate. Assessments of model fit based on this strategy are inadequate. Alter native methods must be sought.
Detecting Inappropriate Test Scores with Optimal and Practical Appropriateness IndicesDrasgow, Fritz; Levine, Michael V.; McLaughlin, Mary E.
doi: 10.1177/014662168701100105pmid: N/A
Several statistics have been proposed as quantitative indices of the appropriateness of a test score as a mea sure of ability. Two criteria have been used to evalu ate such indices in previous research. The first crite rion, standardization, refers to the extent to which the conditional distributions of an index, given ability, are invariant across ability levels. The second criterion, relative power, refers to indices' relative effectiveness for detecting inappropriate test scores. In this paper the effectiveness of nine appropriateness indices is de termined in an absolute sense by comparing them to optimal indices; an optimal index is the most powerful index for a particular form of aberrance that can be computed from item responses. Three indices were found to provide nearly optimal rates of detection of very low ability response patterns modified to simulate cheating, as well as very high ability response patterns modified to simulate spuriously low responding. Opti mal indices had detection rates from 50% to 200% higher than any other index when average ability re sponse vectors were manipulated to appear spuriously high and spuriously low.
Statistical Inference for Coefficient AlphaFeldt, Leonard S.; Woodruff, David J.; Salih, Fathi A.
doi: 10.1177/014662168701100107pmid: N/A
Rigorous comparison of the reliability coefficients of several tests or measurement procedures requires a sampling theory for the coefficients. This paper sum marizes the important aspects of the sampling theory for Cronbach's (1951) coefficient alpha, a widely used internal consistency coefficient. This theory enables researchers to test a specific numerical hypothesis about the population alpha and to obtain confidence intervals for the population coefficient. It also permits researchers to test the hypothesis of equality among several coefficients, either under the condition of inde pendent samples or when the same sample has been used for all measurements. The procedures are illus trated numerically, and the assumptions and deriva tions underlying the theory are discussed.