The Influence of Test Characteristics on the Detection of Aberrant Response PatternsReise, Steven P.; Due, Allan M.
doi: 10.1177/014662169101500301pmid: N/A
Statistical methods to assess the congruence between an item response pattern and a specified item response theory model have recently proliferated. This "person fit" research has focused on the question: To what extent can person-fit indices identify well-defined forms of aberrant item response? This study extended previous person-fit research in two ways. First, an unexplored model for generating aberrant response patterns was explicated. The data-generation model is based on the theory that aberrant item responses result in less psychometric information for the individual than predicated by the parameters of a specified response model. Second, the proposed response aberrancy generation model was implemented to investigate how the aberrancy detection power of a person-fit statistic is influenced by test properties (e.g., the spread of item difficulties). Results indicated that detecting aberrant response patterns was especially problematic for tests with less than 20 items, and for tests with limited ranges of item difficulty. An applied consequence of these results is that certain types of test designs (e.g., peaked tests) and administration procedures (e.g., adaptive tests) potentially act to limit the detection of aberrant item responses.
Expert-System Scores for Complex Constructed-Response Quantitative Items: A Study of Convergent ValidityBennett, Randy Elliot; Sebrechts, Marc M.; Rock, Donald A.
doi: 10.1177/014662169101500302pmid: N/A
This study investigated the convergent validity of expert-system scores for four mathematical constructed-response item formats. A five-factor model comprised of four constructed-response for mat factors and a Graduate Record Examination (GRE) General Test quantitative factor was posed. Confirmatory factor analysis was used to test the fit of this model and to compare it with several alter natives. The five-factor model fit well, although a solution comprised of two highly correlated dimensions_GRE-quantitative and constructed- response—represented the data almost as well. These results extend the meaning of the expert system's constructed-response scores by relating them to a well-established quantitative measure and by indicating that they signify the same underlying proficiency across item formats.
The Effect of Numbers of Experts and Common Items on Cutting Score Equivalents Based on Expert JudgmentNorcini, John; Shea, Judy; Grosso, Louis
doi: 10.1177/014662169101500303pmid: N/A
The effect of different numbers of experts and common items on the scaling of cutting scores derived by experts' judgments was investigated. Four test forms were created from each of two examinations; each form from the first examina tion shared a block of items with one form from the second examination. Small groups of experts set standards on each using a modification of Angoff's (1971) method. Cutting score equivalents were estimated for the matched forms using dif ferent group sizes and numbers of common items; they were compared with cutting score equivalents based on score equating. Results showed that a reduction in error is associated with using more experts or having more items in common between the two forms. For 25 or more common items and five or more judges, the error was about one item on a 100-item test. More than five experts or 25 common items made only a very small difference in error.
Effects of Passage and Item Scrambling on Equating RelationshipsHarris, Deborah J.
doi: 10.1177/014662169101500304pmid: N/A
This study investigated the effects of passage and item scrambling on equipercentile and item response theory equating using a random groups design. For all four tests and for both scramblings used, differences in item and examinee statistics were found to exist between all three forms used (the base form and the two scrambled forms). Up to 50% of the examinees administered a scrambled form would have received a different scale score if the base form equating, rather than the scrambled form equating, had been used to convert their number-correct scores. It is, therefore, suggested that caution be used when scrambled forms are being administered, because in applications such as that studied here, the effects of applying the equating results obtained using a base form to the number-correct scores obtained on a scrambled form can be quite substantial in terms of the numbers of examinees who would receive different scores.
A Comparison of Two Area Measures for Detecting Differential Item FunctioningKim, Seock-Ho; Cohen, Allan S.
doi: 10.1177/014662169101500307pmid: N/A
The area between two item response functions is often used as a measure of differential item functioning under item response theory. This area can be measured over either an open interval (i.e., exact) or closed interval. Formulas are presented for com puting the closed-interval signed and unsigned areas. Exact and closed-interval measures were estimated on data from a test with embedded items intentionally constructed to favor one group over another. No real differences in detection of these items were found between exact and closed-interval methods.
An Empirical Study of the Effects of Small Datasets and Varying Prior Variances on Item Parameter Estimation in BILOGHarwell, Michael R.; Janosky, Janine E.
doi: 10.1177/014662169101500308pmid: N/A
Long-standing difficulties in estimating item parameters in item response theory (IRT) have been addressed recently with the application of Bayesian estimation models. The potential of these methods is enhanced by their availability in the BILOG com puter program. This study investigated the ability of BILOG to recover known item parameters under varying conditions. Data were simulated for a two- parameter logistic IRT model under conditions of small numbers of examinees and items, and different variances for the prior distributions of discrimina tion parameters. The results suggest that for samples of at least 250 examinees and 15 items, BILOG accurately recovers known parameters using the default variance. The quality of the estimation suffers for smaller numbers of examinees under the default variance, and for larger prior variances in general. This raises questions about how practi tioners select a prior variance for small numbers of examinees and items.
On the Efficiency of IRT Models When Applied to Different Sampling DesignsBerger, Martijn P. F.
doi: 10.1177/014662169101500310pmid: N/A
The problem of obtaining designs that result in the greatest precision of the parameter estimates is encountered in at least two situations in which item response theory (IRT) models are used. In so- called two-stage testing procedures, certain designs may be specified that match difficulty levels of test items with abilities of examinees. The advantage of such designs is that the variance of the estimated parameters can be controlled. In situations in which IRT models are applied to different groups, efficient multiple-matrix sampling designs are applicable. The choice of matrix sampling designs will also influence the variance of the estimated parameters. Heuristic arguments are given here to formulate the efficiency of a design in terms of an asymptotic generalized variance criterion, and a comparison is made of the efficiencies of several designs. It is shown that some designs may be found to be most efficient for the one- and two- parameter model, but not necessarily for the three- parameter model.