psychometrika—vol. 83, no. 1, 132–155
SCORE-BASED TESTS OF DIFFERENTIAL ITEM FUNCTIONING VIA PAIRWISE
MAXIMUM LIKELIHOOD ESTIMATION
UNIVERSITY OF MISSOURI
UNIVERSITY OF ZURICH
Edgar C. Merkle
UNIVERSITY OF MISSOURI
Measurement invariance is a fundamental assumption in item response theory models, where the
relationship between a latent construct (ability) and observed item responses is of interest. Violation of
this assumption would render the scale misinterpreted or cause systematic bias against certain groups of
persons. While a number of methods have been proposed to detect measurement invariance violations, they
typically require advance deﬁnition of problematic item parameters and respondent grouping information.
However, these pieces of information are typically unknown in practice. As an alternative, this paper
focuses on a family of recently proposed tests based on stochastic processes of casewise derivatives of the
likelihood function (i.e., scores). These score-based tests only require estimation of the null model (when
measurement invariance is assumed to hold), and they have been previously applied in factor-analytic,
continuous data contexts as well as in models of the Rasch family. In this paper, we aim to extend these
tests to two-parameter item response models, with strong emphasis on pairwise maximum likelihood. The
tests’ theoretical background and implementation are detailed, and the tests’ abilities to identify problematic
item parameters are studied via simulation. An empirical example illustrating the tests’ use in practice is
Key words: pairwise maximum likelihood, score-based test, item response theory, differential item
A major topic of study in educational and psychological testing is measurement invariance,
with violation of this assumption being called differential item functioning (DIF) in the item
response literature (see, for example, Millsap, 2012, for a review). If a set of items violates
measurement invariance, then individuals with the same ability (“amount” of the latent variable)
may systematically receive different scale scores. This is problematic because researchers might
conclude group ability differences when, in reality, the differences arise from unfair items.
Supported by National Science Foundation Grants SES-1061334 and 1460719.
Electronic supplementary material The online version ofthisarticle(https://doi.org/10.1007/s11336-017-9591-
8) contains supplementary material, which is available to authorized users.
Correspondence should be made to Ting Wang, Department of Psychological Sciences, University of Missouri,
Columbia, MO, USA. Email: firstname.lastname@example.org
© 2017 The Psychometric Society