Sensemaking of Process Data from Evaluation Studies of Educational Games: An Application of Cross‐Classified Item Response Theory ModelingFeng, Tianying; Cai, Li
doi: 10.1111/jedm.12396pmid: N/A
Process information collected from educational games can illuminate how students approach interactive tasks, complementing assessment outcomes routinely examined in evaluation studies. However, the two sources of information are historically analyzed and interpreted separately, and diagnostic process information is often underused. To tackle these issues, we present a new application of cross‐classified item response theory modeling, using indicators of knowledge misconceptions and item‐level assessment data collected from a multisite game‐based randomized controlled trial. This application addresses (a) the joint modeling of students' pretest and posttest item responses and game‐based processes described by indicators of misconceptions; (b) integration of gameplay information when gauging the intervention effect of an educational game; (c) relationships among game‐based misconception, pretest initial status, and pre‐to‐post change; and (d) nesting of students within schools, a common aspect in multisite research. We also demonstrate how to structure the data and set up the model to enable our proposed application, and how our application compares to three other approaches to analyzing gameplay and assessment data. Lastly, we note the implications for future evaluation studies and for using analytic results to inform learning and instruction.
Addressing Bias in Spoken Language Systems Used in the Development and Implementation of Automated Child Language‐Based AssessmentBailey, Alison L.; Johnson, Alexander; Shankar, Natarajan Balaji; Veeramani, Hariram; Washington, Julie A.; Alwan, Abeer
doi: 10.1111/jedm.12435pmid: N/A
This article addresses bias in Spoken Language Systems (SLS) that involve both Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) and reports experiments to improve the performance of SLS for automated language and literacy‐related assessments with students who are under served in the U.S. educational system. We frame bias in SLS in terms of testing fairness and validity, stemming in part from the exclusion of sufficiently large training datasets in varieties of English other than General American English (GAE). We adopt an Interpretation/Use Argument approach to validity focused on clarity of constructs and scoring accuracy. While SLS use ASR to automatically transcribe students’ utterances, and apply NLP algorithms to ASR transcripts to measure students’ speech samples, it is well‐documented in studies with adults that ASR is typically more problematic for African American English (AAE) speakers than for other groups due to differences in prosody, pronunciation, word usage, and grammar. We utilized child speech and text corpora to improve algorithms that score oral task responses for child AAE speakers and, in some experiments, children with oral language and reading difficulties. Favorable results provide impetus and possible solutions for fair and inclusive assessments for diverse student groups in the future.
Sequential Reservoir Computing for Log File‐Based Behavior Process Data AnalysesXiong, Jiawei; Wang, Shiyu; Tang, Cheng; Liu, Qidi; Sheng, Rufei; Wang, Bowen; Kuang, Huan; Cohen, Allan S.; Xiong, Xinhui
doi: 10.1111/jedm.12413pmid: N/A
The use of process data in assessment has gained attention in recent years as more assessments are administered by computers. Process data, recorded in computer log files, capture the sequence of examinees' response activities, for example, timestamped keystrokes, during the assessment. Traditional measurement methods are often inadequate for handling this type of data. In this paper, we proposed a sequential reservoir method (SRM) based on a reservoir computing model using the echo state network, with the particle swarm optimization and singular value decomposition as optimization. Designed to regularize features from process data through a computational self‐learning algorithm, this method has been evaluated using both simulated and empirical data. Simulation results suggested that, on one hand, the model effectively transforms action sequences into standardized and meaningful features, and on the other hand, these features are instrumental in categorizing latent behavioral groups and predicting latent information. Empirical results further indicate that SRM can predict assessment efficiency. The features extracted by SRM have been verified as related to action sequence lengths through the correlation analysis. This proposed method enhances the extraction and accessibility of meaningful information from process data, presenting an alternative to existing process data technologies.
Simultaneous Detection of Compromised Items and Examinees with Item Preknowledge in Online Assessments Using Response Time DataZopluoglu, Cengiz
doi: 10.1111/jedm.70030pmid: N/A
The rapid transition from traditional paper‐and‐pencil tests to computer‐based testing systems has significantly altered the educational landscape, particularly during the COVID‐19 pandemic. While online assessments offer numerous advantages, they also present unique challenges, with test security being paramount. This article addresses the critical issue of test fraud in digital assessments, specifically focusing on item preknowledge, where examinees have prior access to test items. Using response‐time data, we propose a statistical framework for simultaneously identifying compromised items and examinees with item preknowledge in a single‐step analysis. Unlike existing methods, our model does not require prior knowledge about the compromised status of items. Using a large‐scale online certification exam dataset, we demonstrate the model's application in detecting significant signals in response times, identifying potentially compromised items, and examinees with potential item preknowledge.
A Group Fit Statistic for the Multilevel Item Response ModelDing, Yishan; Yang, Ji Seung; Han, Youngjin
doi: 10.1111/jedm.70024pmid: N/A
Aberrant behaviors among test‐takers in large‐scale assessments are often more prevalent within specific groups or testing sites. While various techniques have been developed to detect individual‐level test‐takers' aberrant behaviors, research in detecting those behaviors at the group level is rare. We propose a group fit statistic lz2$ l_{z2}$ by extending the lz$ l_z$ statistic to a multilevel item response model. This new statistic demonstrates adequate power and effectively controls the Type I error rate, particularly when true latent variable values are used or when group sizes are large, such as 500. When latent variable estimates are employed, an adjustment to the lz2$l_{z2}$ based on the posterior predictive checking approach can offer improved control over the Type I error rate.
AI and Measurement Concerns: Dealing with Imbalanced Data in AutoscoringLiu, Yunting; Xiang, Yijun; Feng, Xutao; Wilson, Mark
doi: 10.1111/jedm.70031pmid: N/A
Unbiasedness for proficiency estimates is important for autoscoring engines since the outcome might be used for future learning or placement. Imbalanced training data may lead to certain biases and lower the prediction accuracy for classification algorithms. In this article, we investigated several data augmentation methods to lower the negative effect of imbalanced data in measurement settings. Four approaches were examined: (1) Resampling methods, either oversampling or undersampling; (2) Active resampling methods, where the resampling weight is based on representativeness in the training set; (3) Data expansion methods using synonym Replacement, slightly changing the meaning or semantics of the original answers; and (4) Content recreation method using Generative AI (e.g., ChatGPT) to create responses for less populated scores. We compared the performance (e.g., Accuracy, QWK, F1) as well as the distance metric for different combinations of the methods. Two datasets with different imbalanced distributions were used. Results show that all four methods can help to mitigate the bias issue and the efficacy was influenced by the imbalance level, representativeness of the original data and the level of increment in the variety of the response (i.e., lexical diversity). In general, resampling and GenAI with active resampling showed the best overall performance.
Algorithmic Bias in BERT for Response Accuracy Prediction: A Case Study for Investigating Population ValidityGorgun, Guher; Yildirim‐Erbasli, Seyma N.
doi: 10.1111/jedm.12420pmid: N/A
Pretrained large language models (LLMs) have gained popularity in recent years due to their high performance in various educational tasks such as learner modeling, automated scoring, automatic item generation, and prediction. Nevertheless, LLMs are black box approaches where models are less interpretable, and they may carry human biases and prejudices because historical human data have been used for pretraining these large‐scale models. For these reasons, the prediction tasks based on LLMs require scrutiny to ensure that the prediction models are fair and unbiased. In this study, we used BERT—a pretrained encoder‐only LLM for predicting response accuracy using action sequences extracted from the 2012 PIAAC assessment. We selected three countries (i.e., Finland, Slovakia, and the United States) representing different performance levels in the overall PIAAC assessment. We found promising results for predicting response accuracy using the fine‐tuned BERT model. Additionally, we examined algorithmic bias in the prediction models trained with different countries. We found differences in model performance, suggesting that some trained models are not free from bias, and thus the models are less generalizable across countries. Our results highlighted the importance of investigating algorithmic fairness in prediction models utilizing algorithmic systems to ensure models are bias‐free.
Improving Ability Estimation Accuracy for Automated Item Generated Forms under Multistage TestingKim, Stella Y.; Lee, Won‐Chan
doi: 10.1111/jedm.70027pmid: N/A
The emergence of automated item generation (AIG) techniques has intensified discussions around their application in assessment development. Some testing companies have already begun developing software to construct exams using AIG. However, the current literature offers limited insights into the characteristics of items generated through AIG, particularly in the realm of multistage testing (MST). This study proposes a novel approach for adjusting template item parameters to enhance ability estimation accuracy under the MST context. A simulation study was conducted using two MST designs with varying numbers of stages and modules. Results demonstrated that the proposed method significantly improved the accuracy of person parameter estimates compared to a more practical, yet less precise, approach that assumes all item clones share identical parameters.
Generalizability Theory for Randomly Parallel TestingLee, Won‐Chan; Kim, Stella Y.; Shin, Seungwon
doi: 10.1111/jedm.70029pmid: N/A
Advancements in artificial intelligence (AI) have brought significant changes to testing practices, including the emergence of randomly parallel testing (RPT), in which examinees receive different but psychometrically similar sets of items generated from templates or AI‐based systems. This paper presents a generalizability theory (GT) framework for estimating conditional standard errors of measurement (CSEMs) and related reliability indices, with a particular focus on design structures commonly encountered in RPT within domain‐referenced testing contexts. The proposed framework supports the evaluation of score precision across a variety of operational designs, including crossed, nested, and multivariate configurations. Several illustrative examples are provided to demonstrate the methodology in practical settings. The paper also addresses key psychometric and interpretive challenges associated with RPT and outlines promising directions for future research.
Linking Error on Achievement Levels Accounting for Dependencies and Complex SamplingJewsbury, Paul A.
doi: 10.1111/jedm.12439pmid: N/A
Alternate assessments of the same construct or assessments that have undergone a change in the conditions of measurement are often linked in an attempt to establish score comparability. As the link must be estimated from the data, linking contributes error variance into estimators. We propose a novel method to account for linking variance in standard error estimation for achievement or proficiency levels, a primary outcome for many international, national, and U.S. state assessments. Achievement levels are proportions of a population within some range of ability, such as the proportion of the population classified as proficient or advanced. The method is validated in a simulation and with real data. Our method allows for sampling weights and complex sampling and involves an easily calculated correction term that may be added to conventional estimates of the error variance, correcting the conventional estimates for neglecting variance due to linking. Furthermore, the method accounts for dependencies between linking with other sources of variance, allowing for the method to be much more widely applicable to a range of score comparisons than traditional methods of linking variance estimation.