Applied Psychological Measurement

Applied Psychological Measurement | DeepDyve

journal article

LitStream Collection

A Bayesian Random Weights Linear Logistic Test Model for Within-Test Practice Effects

2023 Applied Psychological Measurement

doi: 10.1177/01466216231209752pmid: 37997580

The present paper introduces a random weights linear logistic test model for the measurement of individual differences in operation-specific practice effects within a single administration of a test. The proposed model is an extension of the linear logistic test model of learning developed by Spada (1977) in which the practice effects are considered random effects varying across examinees. A Bayesian framework was used for model estimation and evaluation. A simulation study was conducted to examine the behavior of the model in combination with the Bayesian procedures. The results demonstrated the good performance of the estimation and evaluation methods. Additionally, an empirical study was conducted to illustrate the applicability of the model to real data. The model was applied to a sample of responses from a logical ability test providing evidence of individual differences in operation-specific practice effects.

journal article

LitStream Collection

Controlling the Minimum Item Exposure Rate in Computerized Adaptive Testing: A Two-Stage Sympson–Hetter Procedure

Chao, Hsiu-Yi; Chen, Jyun-Hong

2023 Applied Psychological Measurement

doi: 10.1177/01466216231209756pmid: 37997579

Computerized adaptive testing (CAT) can improve test efficiency, but it also causes the problem of unbalanced item usage within a pool. The effect of uneven item exposure rates can not only induce a test security problem due to overexposed items but also raise economic concerns about item pool development due to underexposed items. Therefore, this study proposes a two-stage Sympson–Hetter (TSH) method to enhance balanced item pool utilization by simultaneously controlling the minimum and maximum item exposure rates. The TSH method divides CAT into two stages. While the item exposure rates are controlled above a prespecified level (e.g., rmin) in the first stage to increase the exposure rates of the underexposed items, they are controlled below another prespecified level (e.g., rmax) in the second stage to prevent items from overexposure. To reduce the effect on trait estimation, TSH only administers a minimum sufficient number of underexposed items that are generally less discriminating in the first stage of CAT. The simulation study results indicate that the TSH method can effectively improve item pool usage without clearly compromising trait estimation precision in most conditions while maintaining the required level of test security.

journal article

Open Access Collection

Using Auxiliary Item Information in the Item Parameter Estimation of a Graded Response Model for a Small to Medium Sample Size: Empirical Versus Hierarchical Bayes Estimation

Naveiras, Matthew; Cho, Sun-Joo

2023 Applied Psychological Measurement

doi: 10.1177/01466216231209758pmid: 38027461

Marginal maximum likelihood estimation (MMLE) is commonly used for item response theory item parameter estimation. However, sufficiently large sample sizes are not always possible when studying rare populations. In this paper, empirical Bayes and hierarchical Bayes are presented as alternatives to MMLE in small sample sizes, using auxiliary item information to estimate the item parameters of a graded response model with higher accuracy. Empirical Bayes and hierarchical Bayes methods are compared with MMLE to determine under what conditions these Bayes methods can outperform MMLE, and to determine if hierarchical Bayes can act as an acceptable alternative to MMLE in conditions where MMLE is unable to converge. In addition, empirical Bayes and hierarchical Bayes methods are compared to show how hierarchical Bayes can result in estimates of posterior variance with greater accuracy than empirical Bayes by acknowledging the uncertainty of item parameter estimates. The proposed methods were evaluated via a simulation study. Simulation results showed that hierarchical Bayes methods can be acceptable alternatives to MMLE under various testing conditions, and we provide a guideline to indicate which methods would be recommended in different research situations. R functions are provided to implement these proposed methods.

journal article

Open Access Collection

Efficiency Analysis of Item Response Theory Kernel Equating for Mixed-Format Tests

Wallmark, Joakim; Josefsson, Maria; Wiberg, Marie

2023 Applied Psychological Measurement

doi: 10.1177/01466216231209757pmid: 38027462

This study aims to evaluate the performance of Item Response Theory (IRT) kernel equating in the context of mixed-format tests by comparing it to IRT observed score equating and kernel equating with log-linear presmoothing. Comparisons were made through both simulations and real data applications, under both equivalent groups (EG) and non-equivalent groups with anchor test (NEAT) sampling designs. To prevent bias towards IRT methods, data were simulated with and without the use of IRT models. The results suggest that the difference between IRT kernel equating and IRT observed score equating is minimal, both in terms of the equated scores and their standard errors. The application of IRT models for presmoothing yielded smaller standard error of equating than the log-linear presmoothing approach. When test data were generated using IRT models, IRT-based methods proved less biased than log-linear kernel equating. However, when data were simulated without IRT models, log-linear kernel equating showed less bias. Overall, IRT kernel equating shows great promise when equating mixed-format tests.

journal article

LitStream Collection

Two Statistics for Measuring the Score Comparability of Computerized Adaptive Tests

Wyse, Adam E.

2023 Applied Psychological Measurement

doi: 10.1177/01466216231209749pmid: 37997578

This study introduces two new statistics for measuring the score comparability of computerized adaptive tests (CATs) based on comparing conditional standard errors of measurement (CSEMs) for examinees that achieved the same scale scores. One statistic is designed to evaluate score comparability of alternate CAT forms for individual scale scores, while the other statistic is designed to evaluate the overall score comparability of alternate CAT forms. The effectiveness of the new statistics is illustrated using data from grade 3 through 8 reading and math CATs. Results suggest that both CATs demonstrated reasonably high levels of score comparability, that score comparability was less at very high or low scores where few students score, and that using random samples with fewer students per grade did not have a big impact on score comparability. Results also suggested that score comparability was sometimes higher when the bottom 20% of scorers were used to calculate overall score comparability compared to all students. Additional discussion related to applying the statistics in different contexts is provided.

Showing 1 to 5 of 5 Articles

Articles per page

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

1989

1988

1987

1986

1985

1984

1983

1982

1981

1980

1979

1978

1977

Related Journals: