Echoes of AI: Investigating the downstream effects of AI assistants on software maintainabilityBorg, Markus; Hewett, Dave; Hagatulah, Nadim; Couderc, Noric; Söderberg, Emma; Graham, Donald; Kini, Uttam; Farley, Dave
doi: 10.1007/s10664-026-10889-1pmid: N/A
ContextAI assistants, like GitHub Copilot and Cursor, are transforming software engineering. While several studies highlight productivity improvements, their impact on maintainability requires further investigation.ObjectiveThis study investigates whether co-development with AI assistants affects software maintainability, specifically how easily other developers can evolve the resulting source code.MethodWe conducted a two-phase, preregistered controlled experiment involving 151 participants, 95% of whom were professional developers. In Phase 1, participants added a new feature to a Java web application, with or without AI assistance. In Phase 2, a randomized controlled trial, new participants evolved these solutions without AI assistance.ResultsPhase 2 revealed no significant differences in subsequent evolution with respect to completion time or code quality. Bayesian analysis suggests that any speed or quality improvements from AI use were at most small and highly uncertain. Observational results from Phase 1 corroborate prior research: using an AI assistant yielded a 30.7% median reduction in completion time, and habitual AI users showed an estimated 55.9% speedup.ConclusionsOverall, we did not detect systematic maintainability advantages or disadvantages when other developers evolved code co-developed with AI assistants. Within the scope of our tasks and measures, we observed no consistent warning signs of degraded code-level maintainability. Future work should examine risks such as code bloat from excessive code generation and cognitive debt as developers offload more mental effort to assistants.
On the outliers of file-structure evolution: a mining study of GitHub software repositoriesLogemann, Matthijs; Rukmono, Satrio A.; Chaudron, Michel R. V.; Krüger, Jacob
doi: 10.1007/s10664-026-10891-7pmid: N/A
ContextModern software systems change continuously. Larger changes like architectural redesigns, feature additions, or system-wide refactorings regularly impact the file structures within a software repository (i.e., developers adding, deleting, or moving files).ObjectiveWhile a normal evolutionary phenomenon, file-structure changes in software repositories have received little attention in past research. An important question that arises is whether outliers (i.e., changes increasing or decreasing file structures more strongly) are potential signs of quality issues in repository management and tooling.MethodIn this article, we contribute the first large-scale study on outliers of file-structure changes. To this end, we investigated more than 12.2 million file-structure changes from 94,247 GitHub repositories that span ten programming languages. We first performed a quantitative analysis of all these changes to establish a baseline regarding the depths of file structures and changes to these. Using this baseline, we identified and manually inspected 3,049 outliers.ResultsOur quantitative data shows that file-structure changes are pervasive in the evolution of software repositories, representing 17.9% of all commits across the studied projects. Via our manual inspection, we found that outliers are often associated with programming errors, initial setups, and misconfigurations of package managers. Thus, they are an indicator of potential mistakes and quality problems.ConclusionsOur findings demonstrate that current practices and tools for managing software repositories should take file-structure changes into account. This could help practitioners monitor for and mitigate erroneous or unintended structural changes. Researchers can use our methodology and findings to design follow-up studies and new techniques for more robust repository management.
Machine learning, deep learning, or large language models: An empirical study on multi-label requirements classificationWang, Wenhao; Peng, Jiaxi; Xiao, Hongbin; Hua, Yang; Zhou, Yufei; Li, Zhi; Wang, Xiaoli
doi: 10.1007/s10664-026-10879-3pmid: N/A
ContextAutomated requirements classification is crucial for software quality assurance, yet it remains a challenging task, particularly for multi-label classification (MLC) where requirements can span multiple quality attributes. While traditional machine learning (ML) and deep learning (DL) methods have been widely studied, the true potential of modern Large Language Models (LLMs) in this complex domain remains largely unexplored.ObjectiveThis paper presents the first comprehensive empirical study to systematically evaluate generative LLMs for multi-label requirements classification. We benchmark them against reproduced state-of-the-art traditional methods and dissect the impact of three core drivers of LLM performance: model architecture, Parameter-Efficient Fine-Tuning (PEFT), and prompting strategies.MethodTo ensure robust external validity, our evaluation leverages two distinct datasets: the widely-used, imbalanced EMSE benchmark and a newly constructed, balanced App Reviews Balanced Dataset (ARBD). We assess performance across various scenarios, including zero-shot, few-shot, and fine-tuned settings, using a suite of standard classification metrics.ResultsOur findings reveal critical, context-dependent insights. First, we establish that data distribution dictates the optimal paradigm: while Deep Learning (BERT) and Classical ML (SVM) define the “accuracy ceiling” in imbalanced and balanced supervised settings respectively, LLMs offer decisive advantages in high-recall scenarios (critical for requirements discovery) and are the only viable solution in low-data environments. Second, we demonstrate that architectural suitability trumps raw scale, as a moderately-sized open-source LLM (Qwen3-14B) consistently outperforms massive frontier models in F1-score. Finally, we show that PEFT (LoRA) serves a dual role dictated by data balance—acting as a precision-recall trade-off lever in imbalanced scenarios, but as a robust performance amplifier in balanced ones.ConclusionLLM-based approaches represent a powerful and versatile new paradigm. Our findings provide a foundational guide and a clear decision framework for practitioners, highlighting how data characteristics (balance) and project goals (accuracy vs. coverage) determine the choice between Traditional Models and LLMs. This study paves the way for a more effective and context-aware application of LLMs in requirements engineering.
Exploring and improving knowledge distillation for pre-trained code modelsSun, Weifeng; Wu, Ruifeng; Li, Hongyan; Fu, Ying; Yu, Min; Yan, Meng
doi: 10.1007/s10664-026-10880-wpmid: N/A
Recently, pre-trained code models (PCMs) have advanced software development by automating tasks and improving productivity. However, their large size hinders seamless adoption in developers’ daily workflows, as local execution is often infeasible and reliance on cloud-based services raises concerns about data privacy. These challenges highlight the need for effective compression techniques to enable secure and efficient on-device deployment. To address this problem, this paper investigates knowledge distillation (KD) as a means to compress large PCMs. We systematically evaluate the effectiveness of KD on PCMs by comparing different distillation paradigms across various code generation and understanding tasks. Our results show that feature-based distillation generally outperforms response-based approaches, though the effectiveness varies depending on the specific PCM and downstream task. Furthermore, we identify key factors that influence the effectiveness of the feature-based knowledge distillation, including loss functions, mapping configurations, and the number of intermediate layer distillation iterations. Building on these insights, we propose BOKD, a Bayesian optimization–based method that adaptively selects optimal distillation configurations according to the student model’s capacity and task complexity. Empirical results demonstrate the practicality of BOKD. A compact 2-layer model (just 5% of teacher parameters) retains 96–97% (relative performance score) of teacher performance on classification tasks, while achieving a 30% perplexity reduction over baseline distillation methods. Our contributions include a comprehensive empirical studies on distillation paradigms, and a novel parameter optimization technique to enhance KD performance.
Do SDN configuration changes get reviewed differently? An empirical study at TELUSKansab, Samah; Aïdasso, Henri; Bordeleau, Francis; Tizghadam, Ali
doi: 10.1007/s10664-026-10897-1pmid: N/A
Configuration files are crucial in Software-Defined Networking (SDN) as they define policies required for the dynamic and safe management of large-scale network traffic. Frequent changes to these files can indicate adaptability and responsiveness but may also suggest instability or frequent reconfiguration needs, due to the intricate dependencies between network components and the risk of misconfigurations that can affect network performance and security. This complexity introduced by these configuration changes poses significant challenges, complicating both the development and review processes. Managing and reviewing these changes effectively is essential to ensure that the network remains robust, secure, and optimized, making it crucial to understand and address the difficulties associated with configuration files in SDN context. This paper presents the results of a study conducted in collaboration with our industrial partner, a telecom company providing SDN-based solutions, to investigate the specific challenges related to the review of configuration files compared to traditional development and documentation files. We analyze 8,495 GitLab Merge Requests (MRs) from five configuration-centric projects. Using both quantitative and qualitative methods, we compare configuration MRs with development and documentation MRs, and examine differences between configuration bug-related and non-bug-related MRs. Our findings show that configuration-dominant MRs (more than 50% of the files are configuration) receive significantly less review participation and activity than development MRs, with less consistent quality and more questions raised. However, configuration-inclusive MRs (less than 50% configuration files) show better review quality and alignment with development standards, featuring more suggestions and evaluations. When compared to documentation MRs, both configuration-dominant and configuration-inclusive MRs exhibit higher review activity and engagement, take longer to review, and are more frequently self-managed. Configuration-dominant bug-related MRs demonstrate significantly higher review activity, engagement, and approval, emphasizing technical accuracy and achieving high-quality ratings. Configuration-inclusive bug-related MRs related to bugs show consistent review quality similar to non-bug-related MRs. The main contributions of the paper are: a comprehensive quantitative and qualitative comparison of reviews on configuration and traditional files, and an assessment of the impact of bugs in configuration files on the review process. These contributions give practitioners valuable insights to identify inefficiencies that reduce participation and effectiveness in configuration reviews in SDN context, while highlighting areas for improving review practices.
CMF-Vul: Advancing automated vulnerability detection via contrastive multimodal fusion and challenge-driven representation learningLi, Quanfeng; Jiang, Guiyuan; He, Peilan; Liu, Wenwen; Dong, Junyu; Sun, Yidan; Wu, Jiahui
doi: 10.1007/s10664-026-10886-4pmid: N/A
Automated vulnerability detection remains challenging due to heterogeneous vulnerability cues and the diversity of source code representations. This paper presents CMF-Vul, a contrastive multimodal fusion framework that systematically integrates token-level semantics, program dependence graphs (PDGs), and rendered PDG images for function-level vulnerability detection. CMF-Vul performs cross-modal alignment and employs an instance-wise gated fusion module to adaptively weight modalities for each function, mitigating the variability of evidence across vulnerability patterns. To address the extreme sparsity of vulnerability signals and the high similarity between vulnerable and benign code, we propose Challenge-Driven Representation Learning (CDRL): (i) semantic-preserving positive generation via program transformations and (ii) multimodal hard-negative mining with adaptive contrastive weighting to emphasize boundary-adjacent confusable negatives in a unified embedding space. Extensive experiments on three public benchmarks (FFmpeg&Qemu, Big-Vul, and SARD) demonstrate that CMF-Vul consistently outperforms representative state-of-the-art baselines, and achieves a better precision-recall balance, particularly under noisy and imbalanced settings. Ablation studies further validate the effectiveness of both the proposed multimodal fusion and contrastive optimization components. Our implementation and scripts are publicly available.
Open source software development tool installationSalerno, Larissa; Treude, Christoph; Thongtanunam, Patanamon
doi: 10.1007/s10664-026-10885-5pmid: N/A
As the world of technology advances, so do the tools that software developers use to create new programs. In recent years, software development tools have become more popular, allowing developers to work more efficiently and produce higher-quality software. Still, installing such tools can be challenging for novice developers at the early stage of their careers, as they may face issues such as compatibility problems (e.g., with operating systems) and unclear instructions. Therefore, this work aims to investigate the challenges novice developers face when installing software development tools and the strategies they employ to overcome them. To investigate these, we conducted an analysis of 24 live software installation sessions to observe the difficulties developers encounter, the strategies they apply, and the types of information sources they consult when facing obstacles. We also conducted a validation survey with 144 students to support and expand our findings. Our results reveal recurring challenges such as unclear or dysfunctional documentation, complicated installation processes, version incompatibility, and lack of feedback during installation. To address these, participants used strategies such as reformulating search queries, reading instructions more carefully, and searching for alternative sources of documentation. These sources included community platforms (e.g., Stack Overflow), video tutorials, blog posts, and official documentation. Based on these findings, we provide practical recommendations for tool vendors, tool users, and researchers to improve the installation experience for novice developers.
Line-level bug-finding power of static analysis rules: a case study ofTeamscaleYe, Liwei; Nie, Yuge; Zhou, Yufei; Yang, Yibiao; Lu, Hongmin; Qian, Junyan; Zhou, Yuming
doi: 10.1007/s10664-026-10895-3pmid: N/A
Teamscale is a commercial static analysis tool (SAT) that analyzes source code without executing it to identify potential quality problems. In Teamscale, static analysis rules are referred to as checks, and each violation generates a report referred to as a finding. Its 232 default Java checks can produce large numbers of findings in large projects, making review prioritization necessary under limited inspection budgets. Although Teamscale assigns severity levels to its checks—red (high) and yellow (low)—their line-level bug-finding power has not been systematically investigated. This study evaluates the line-level bug-finding power of Teamscale checks through the findings they report, derives an empirically grounded check ranking, examines whether red-severity checks exhibit stronger bug-finding power, and investigates whether ranked check orderings help practitioners detect more bugs under the same fixed inspection budget. We conducted an empirical study on 17 Apache Software Foundation (ASF) projects comprising 134 releases. To rank Teamscale checks by their line-level bug-finding power, we considered three types of check-ranking methods. We then compared these methods under a unified evaluation setting to identify the most reliable one in our experimental setting and used it to derive the final check ranking. Our analysis shows that checks related to code size and structural complexity demonstrate the strongest bug-finding power. We also find that red-severity checks generally exhibit stronger bug-finding power than yellow-severity checks, indicating that severity provides a coarse yet useful signal for inspection prioritization. More importantly, when findings are prioritized according to the ranked check ordering, practitioners can detect more buggy lines under the same fixed inspection budget. Our results suggest that prioritizing Teamscale findings according to the empirically derived check ranking can improve bug-finding efficiency under limited inspection budgets. The resulting global check ranking can serve as a reasonable default in practice. For projects where this global ranking aligns less well with project-specific bug patterns, project-specific recalibration may still be beneficial when sufficient local history is available.
CDBench: Benchmarking the mutation testing capabilities of LLMs with code defendersRomazanov, Artur; Fraser, Gordon; Herbold, Steffen
doi: 10.1007/s10664-026-10901-8pmid: N/A
Most traditional benchmarks for evaluating Large Language Models (LLMs) in software development suffer from a narrow focus, high risks of data contamination, and static difficulty levels that fail to keep pace with rapid model evolution. To address these limitations, we introduce CDBench, a novel zero-sum benchmark based on the Code Defenders mutation testing game. By pitting models against each other in a competitive environment—where “attackers” introduce code mutations and “defenders” create tests to detect them—CDBench establishes a dynamic difficulty curve that scales naturally without human intervention. Our experiments reveal that while LLMs can generate diverse mutations, they often struggle with code validity; nevertheless, the framework effectively distinguishes model capabilities, highlighting the superior test generation of models like Gemini 2.5 Pro while exposing the instruction-following limitations of reasoning models. These findings demonstrate that zero-sum games offer a viable, contamination-resistant solution to the stagnation of current evaluation methodologies.
Is this build failure related to my patch? An empirical study of unrelated build failures in continuous integrationHuang, Yonghui Andie; da Costa, Daniel Alencar; Dick, Grant; El Mezouar, Mariam; Xiao, Liwen
doi: 10.1007/s10664-026-10874-8pmid: N/A
In a hectic Continuous Integration (CI) environment, where several builds are triggered concurrently, legitimate build failures (e.g., not caused by flaky tests) may not always be related to the current push. These unrelated build failures can burden developers as they devote hours to attest whether errors are truly associated with their present changes. In this paper, we extract 77,354 CI build failures from 7 open source projects to understand and identify unrelated build failures. We attempt to provide an indication for developers about whether a build failure is likely to be related to the current push or not. Our results reveal that developers likely invest a median of 4 hours to determine whether a build failure is (un)related to their pushes. We perform a document analysis on a sample of 371 unrelated build failures (based on the 95% confidence level and 5% confidence interval from 10,316 potentially unrelated failures) to understand why build failures are deemed as unrelated by developers. The themes generated from our document analysis reveal that unrelated tests failures represent 20% of the cases of why build failures are deemed unrelated by developers. To predict whether a build failure is unrelated to the current push, we extract 33 features from issue reports, issue comments, and from the commits pertaining to the triggering push. We build semi-supervised PU-learning models over seven Apache projects and achieve precision ranging from \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0.70 \pm 0.01$$\end{document} to \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0.88 \pm 0.02$$\end{document} , recall ranging from \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0.30 \pm 0.03$$\end{document} to \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$1.00 \pm 0.00$$\end{document}, and F1-scores ranging from \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0.44 \pm 0.03$$\end{document} to \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0.91 \pm 0.00$$\end{document}, while the area under the ROC curve (AUC) spans \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0.63 \pm 0.02$$\end{document} to \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0.97 \pm 0.03$$\end{document}. Our analysis of feature importance reveals that (i) the time taken from a submitted patch to the build-triggering push (CI latency), (ii) build failures sharing similar error messages with recent failures, and (iii) the number of comments preceding the build failure, are all efficient indicators for identifying potential unrelated build failures. The semi-supervised approach proposed in this work can help developers identify build failures that are unrelated to their current push, providing actionable guidance such as re-running builds, inspecting infrastructure logs, or prioritizing code-level debugging based on prediction outcomes.