psychometrika—vol. 82, no. 3, 717–736
SIMULATION-EXTRAPOLATION WITH LATENT HETEROSKEDASTIC ERROR
J. R. Lockwood and Daniel F. McCaffrey
EDUCATIONAL TESTING SERVICE
This article considers the application of the simulation-extrapolation (SIMEX) method for measure-
ment error correction when the error variance is a function of the latent variable being measured. Het-
eroskedasticity of this form arises in educational and psychological applications with ability estimates
from item response theory models. We conclude that there is no simple solution for applying SIMEX that
generally will yield consistent estimators in this setting. However, we demonstrate that several approximate
SIMEX methods can provide useful estimators, leading to recommendations for analysts dealing with this
form of error in settings where SIMEX may be the most practical option.
Key words: achievement test scores, measurement error, item response theory, covariate adjustment,
Analysts often want to estimate statistical models involving quantities that are not directly
observable. For example, in applied educational research, it is often necessary to adjust for prior
achievement differences across students with different educational experiences when estimating
the effects of those experiences on student outcomes (Battauz & Bellio, 2011; Lockwood &
McCaffrey, 2014a). The data typically available for such adjustment are results from standardized
assessments, which measure achievement with error. It is well known that ignoring measurement
error can bias parameter estimators from statistical models, necessitating ways to account for
measurement error in testing results when estimating models involving latent achievement.
However, applied educational researchers face at least two challenges to dealing with test
measurement error appropriately. The ﬁrst is that it is commonplace for databases available for
secondary analysis to provide only coarsened information about test performance. Test items
are often analyzed with item response theory models (IRT; Lord, 1980, 1984; van der Linden
& Hambleton, 1997), which are then used to compute scale scores (i.e., ability estimates) for
individual examinees that are reported to students, parents, and educators. It is common for
databases to contain only these scale scores and their “conditional standard errors of measurement”
(CSEMs; Lord, 1980), rather than the item-level response data themselves. As such, the majority
of research and policy analyses with longitudinal testing data in the United States are performed
with IRT scores rather than item responses. For example, consequential performance measures
are computed for many teachers and school leaders throughout the USA each year, and we are
not aware of any cases in which item-level response data are used directly in these computations.
Electronic supplementary material The online version of this article (doi:10.1007/s11336-017-9556-y) contains
supplementary material, which is available to authorized users.
The research reported here was supported in part by the Institute of Education Sciences, US Department of Education,
through Grant R305D140032 to ETS. The opinions expressed are those of the authors and do not represent views of the
Institute or the US Department of Education. We thank Shelby Haberman, Hongwen Guo, Rebecca Zwick, the Editor, an
Associate Editor, and three anonymous referees for helpful comments on earlier drafts.
Correspondence should be made to J. R. Lockwood, Educational Testing Service, Princeton, NJ, USA. Email:
© 2017 The Psychometric Society