Faking in High-Stakes Personality Assessments: A Response-Time-Based Latent Response Mixture Modeling ApproachSeitz, Timo; Ulitzsch, Esther
doi: 10.1177/00131644261422169pmid: 41869540
When personality assessments are employed in high-stakes contexts, there is the risk that test-takers provide overly positive descriptions of themselves. This response bias is known as faking and has often been addressed in latent variable models through an additional dimension capturing each test-taker’s faking degree. Such models typically assume a homogeneous response strategy for all test-takers, with substantive traits and faking jointly influencing responses to all items. In this article, we present a latent response mixture item response theory (IRT) model of faking that accounts for changes in test-takers’ response strategies over the course of the assessment. The model translates theoretical considerations about test-taker behavior into different model components for item responses and corresponding item-level response times (RT), thereby allowing to account for, identify, and investigate different faking-related response strategies on the person-by-item level. In a parameter recovery study, we found that the model parameters can be estimated well under realistic conditions. Also, we applied the model to an empirical dataset (N = 1,824) from a job application context, showcasing its utility in real high-stakes assessment data. We conclude the article by discussing the role of the model for psychological measurement as well as substantive research.
Conditional Dependencies Between Response Time and Item Discrimination: An Item-Level Meta-AnalysisGilbert, Joshua B.; Young, William S.; Himmelsbach, Zachary; Ulitzsch, Esther; Domingue, Benjamin W.
doi: 10.1177/00131644261426972pmid: 41859484
The use of process data, such as response time (RT) in psychometrics, has generally focused on the relationship between speed and accuracy. The potential relationships between RT and item discrimination remain less explored. In this study, we propose a model for simultaneously estimating the relationships between RT and item discrimination at the person, item, and person-by-item (residual) levels and illustrate our approach through an item-level meta-analysis of 40 empirical data sets comprising 1.84 million item responses. We find no evidence of average differences in item discrimination between items of different time intensity or persons of different average RT, while residual RT strongly and negatively predicts item discrimination (pooled coef. = -.27% per 1% difference in RT, SE = .04, τ = .17). While heterogeneity is high, we find little evidence of moderation by overall data set characteristics. Flexible generalized additive models show that the relationship between residual RT and item discrimination is generally curvilinear, with discrimination maximized just below average RT and minimized at the extremes. Our results suggest that RT data can provide insights into the measurement properties of educational and psychological assessments, but that the relationships between RT and item discrimination are highly variable.
Interactions Between Termination Criteria and Ability Estimators in Computerized Adaptive TestingLiu, Xinyu; Weiss, David J.
doi: 10.1177/00131644261453945pmid: N/A
Computerized adaptive testing (CAT) aims to optimize measurement by tailoring item administration to individual examinees. The efficiency and precision of a CAT heavily depend on the choice of ability (θ) estimator and the termination criterion (stopping rule). Prior research suggests these components interact, but comprehensive evaluations across varying item bank characteristics remain limited. This simulation study investigated the interactive effects of four θ estimators (maximum likelihood [MLE], weighted likelihood [WLE], maximum a posteriori [MAP], and expected a posteriori [EAP]) and four termination criteria (fixed-length, standard error of measurement [SEM], minimum information [MI], and change-in-estimate [Δθ]) on measurement bias, precision (RMSE), and test length. These combinations were evaluated across low- (100-item) and high-information (500-item) item banks with both flat and peaked information distributions using the three-parameter logistic model. The results demonstrated that the optimal CAT configuration is contingent on item bank size and shape. Across all conditions, WLE emerged as the most robust estimator, effectively neutralizing the boundary estimation issues of MLE and the shrinkage bias characteristic of Bayesian estimators. In high-information banks, the SEM and fixed-length rules yielded the lowest conditional RMSE and bias regardless of bank shape. However, in low-information peaked banks, the strict SEM rule frequently failed to reach precision targets at the θ extremes, resulting in inefficient, maximum-length tests. Under these sparse conditions, the Δθ rule paired with WLE provided a superior balance of accuracy and efficiency by halting administration when precision gains stagnated. Conversely, the MI rule consistently exhibited the highest bias and RMSE. These findings underscore that optimal CAT design is not a one-size-fits-all solution. For high-quality banks, WLE paired with an SEM or fixed-length rule is recommended. For lower-quality banks, practitioners should adopt a Δθ rule or a hybrid SEM approach to prevent inefficient test elongation.
Misclassification Produced by Rapid-Guessing Identification Methods and Their Suitability Under Various ConditionsHolopainen, Santeri; Metsämuuronen, Jari; Laakso, Mikko-Jussi; Kujala, Janne
doi: 10.1177/00131644261419426pmid: 41743843
Response Time Threshold Methods (RTTMs) are widely used to identify rapid-guessing behavior (RG) in low-stakes assessments, yet face two key challenges: (a) inevitable misclassifications due to overlapping response time distributions of engaged and disengaged responses, and (b) lack of agreement on which method to use under varying conditions. This simulation study evaluated five RTTMs. Item responses and response times were generated from either a one-component model without RG or a two-component mixture model with RG in the population. Distribution, item, and person parameters were varied. Results showed that when the population contained RG, the mixture lognormal distribution-based method (MLN) was the most robust approach and estimated precise thresholds closest to the time points at which the misclassification rates were minimized, even when bimodality was more difficult to detect. The cumulative proportion method (CUMP) was less robust but also accurate when successful, though less precise. In addition, when the population did not include RG, CUMP was the only method to set thresholds for a notable proportion of cases. The methods were generally more conservative than liberal, though the mixture response time quantile method (MRTQ) was neither. The results are discussed in the light of prior RG research and the methods’ characteristics, and future directions are suggested. Ultimately, for practical settings, we recommend a six-step process for RG identification that utilizes both a mixture modeling approach (MLN or MRTQ) and the CUMP method.
Identification and Diagnosis of Misreporting in SurveysLi, Jing; Yang, Xiao; Engelhard, George
doi: 10.1177/00131644261451209pmid: 42261392
Misreporting and other forms of aberrant responding can undermine the validity of survey-based inferences. Person-level evaluation of aberrant responses is rarely conducted because inspecting individual response patterns is time-intensive. This study proposes an integrated approach for identifying, classifying, and interpreting misfitting response patterns using nonparametric visualizations of person response functions combined with clustering of person response functions. The first step is to calibrate the survey items using an IRT model, such as the Rasch model, to establish an interpretable latent continuum with item-location ordering. Next, person-fit statistics, such as infit and outfit mean square error statistics, are examined, and a smaller subset of response patterns is flagged as misfitting. The third step is to use a nonparametric Hanning procedure to create person response functions, followed by clustering misfitting person response functions using Partitioning Around Medoids (PAM). The advantage of PAM over other clustering methods is that an observed response pattern is identified as a representative case for each cluster. Clusters can then be identified that correspond to an appropriate interpretation for the cluster, such as underreporting, inconsistent reporting, and overreporting patterns. Finally, decisions can be made about how to address aberrant person response patterns. The Household Food Security Survey Module from the U.S. Census is used as an illustration. These visualizations can support transparent data-quality evaluation with the potential for survey improvements.
Signposts on the Path From Nominal to Ordinal Scales: Moving From a Discrete to a Continuous ViewNalbandyan, Roza; Gilbert, Joshua B.; Franco, Vithor R.; Domingue, Benjamin W.
doi: 10.1177/00131644261440556pmid: 42116846
Polytomous item response data are typically classified as either nominal or ordinal, but this binary distinction may oversimplify their true structure. In this paper, we reframe the nominal–ordinal distinction as a continuum and introduce six empirical indices to quantify the degree of category ordering in item response data. Through extensive simulations with various item response theory (IRT) models and applications to 245 empirical datasets, we evaluate the indices’ sensitivity, computational efficiency, and interpretability across diverse measurement contexts. Our findings show that two parametric indices—Mean Difference between Slope Parameters (Index 5) and Arctangent of Paired Category Ratios (Index 6)—are particularly robust and informative, even with low-frequency categories. These indices offer a practical tool for assessing whether and how item categories align with ordinal assumptions, supporting more accurate measurement and model selection. We conclude that treating ordering as a continuum, rather than a binary property, provides deeper insights for psychometric practice and strengthens the connection between empirical response patterns and their theoretical representations.
Assessing the Unconditional and Conditional External Validity of Noncognitive Test Scores: A Unifying Model-Based ProposalFerrando, Pere J.; Morales-Vives, Fabia; Duran-Bonavila, Silvia; Navarro-González, David
doi: 10.1177/00131644261440168pmid: 42080149
Evidence of external validity based on individual score estimates is still relevant in many psychometric applications. From a model-based perspective, however, the topic appears to have been rather neglected in recent decades. Thus, in structural equation modelling (SEM), this evidence is sought to be obtained structurally, bypassing the scoring stage. And, in item response theory (IRT), the score interest mostly focuses on internal properties. Taking this state of affairs into account, this paper develops and proposes a model-based approach, intended for noncognitive measures, that combines SEM and IRT developments, and which allows a detailed assessment of the external validity of a class of score estimates to be carried out. The starting point is a general extended model that also includes the relevant external variables. From this general model, four well-known extended IRT models can be derived and fitted at the structural level. Next, on the basis of the structural results, a series of unconditional (population-dependent) and conditional (population-independent) indices that describe the model-implied relation between the score estimates and each external variable are developed and proposed. The practical relevance of the proposal is discussed mainly around three applications: assessing model appropriateness, obtaining point and interval prediction estimates at the individual level, and shortening a test while optimizing the external validity of the resulting version. The functioning of the proposal is illustrated using a real-data example.
Comparing Different Approaches of (Not) Accounting for Rapid Guessing in Plausible Values EstimationWelling, Jana; Zink, Eva; Gnambs, Timo
doi: 10.1177/00131644251395590pmid: 41551947
Educational large-scale assessments provide information on ability differences between groups, informing policies and shaping educational decisions. However, some of these differences might partly reflect variations in test-taking motivation rather than in actual abilities. Existing approaches for mitigating the distorting effects of rapid guessing focus mainly on point estimates of abilities, although research questions often refer to latent variables. The present study seeks to (a) determine the bias introduced by rapid guessing in group comparisons based on plausible value estimates and (b) introduce and evaluate different approaches of handling rapid guessing in the estimation of plausible values. In a simulation study, four models were compared: (1) a baseline model did not account for rapid guessing, (2) a person-level model incorporated rapid guessing as a respondent characteristic in the background model, (3) a response-level model filtered responses with item response times lower than a predetermined threshold, and (4) a combined model merged the person- and response-level approaches. Results show that the response-level and combined model performed best while accounting for rapid guessing on the person level did not suffice. An empirical example using data from a German large-scale assessment (N = 478) demonstrates the applicability of all approaches in practice. Recommendations for future research are given to improve ability estimation.
Estimating Trends With Differential Item Functioning: A Comparison of Five IRT-Based ApproachesEngels, Oskar; Lüdtke, Oliver; Robitzsch, Alexander
doi: 10.1177/00131644251408818pmid: 41835215
In longitudinal assessments, tests are frequently used to estimate trends over time. However, when item parameters lack invariance, time-point comparisons can be distorted, necessitating appropriate statistical methods to achieve accurate estimation. This study compares trend estimates using the two-parameter logistic (2PL) model under item parameter drift (IPD) across five trend-estimation approaches for two time points: First, concurrent calibration, which jointly estimates item parameters across multiple time points. Second, fixed calibration, which estimates item parameters at a single time point and fixes them at the other time point. Third, robust linking with Haberman and Haebara as linking methods with Lp or L0 losses. Fourth, non-invariant items are detected using likelihood-ratio tests or the root mean square deviation statistic with fixed or data-driven cutoffs, and trend estimates are then recomputed using only the detected invariant items under partial invariance. Fifth, regularized estimation under a smooth Bayesian information criterion (SBIC) is applied, shrinking small or null IPD effects toward zero while estimating all others as nonzero. Bias and relative root mean square error (RMSE) were evaluated for the mean and SD at T2. An empirical example using synthetic longitudinal reading data, applying the trend-estimation approaches, is provided. The results indicate that the regularized estimation with SBIC performed best across conditions, maintaining low bias and RMSE, followed by robust linking methods. Specifically, Haberman linking with the L0 loss function showed superior performance under unbalanced IPD, outperforming the partial invariance approaches. Concurrent and fixed calibration showed the poorest trend recovery under unbalanced IPD conditions.
Beyond One-Size-Fits-All: A Differential Sensitivity Framework for Machine Learning–Based Detection of Anomalous Survey ResponsesDing, Cody
doi: 10.1177/00131644261448404pmid: 42238019
Anomalous survey responses, including random, careless, extreme, acquiescent, straightline, and alternating responding, threaten the validity of survey-based research. Machine learning (ML) algorithms offer flexible, model-agnostic alternatives to traditional detection methods, yet their relative effectiveness across anomaly types remains poorly understood. This study evaluated 11 unsupervised anomaly detection algorithms spanning four paradigms (distance-based, density-based, reconstruction-based, and tree/boundary-based) against six simulated anomaly types embedded in a realistic survey dataset (N = 3,000). Results revealed pronounced differential sensitivity: globally deviant patterns (random, extreme, alternating) were universally detectable, whereas careless and acquiescent responding required reconstruction- or boundary-based methods, and straightline responding resisted detection by all algorithms (maximum area under the receiver operating characteristic curve [AUC-ROC] < .70). No single algorithm dominated across all types. These findings argue for multimethod approaches combining ML algorithms with traditional response quality indicators, and provide a framework for selecting detection methods based on anticipated anomaly types.