Empirical Software Engineering

Empirical Software Engineering | DeepDyve

journal article

LitStream Collection

Opportunities and security risks of technical leverage: A replication study on the NPM ecosystem

Samaana, Haya; Costa, Diego Elias; Abdellatif, Ahmad; Shihab, Emad

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10648-8pmid: N/A

To comply with high productivity demands, software developers reuse free open-source software (FOSS) code to avoid reinventing the wheel when incorporating software features. The reliance on FOSS reuse has been shown to improve productivity and the quality of delivered software; however, reusing FOSS comes at the risk of exposing software projects to public vulnerabilities. Massacci and Pashchenko have explored this trade-off in the Java ecosystem through the lens of technical leverage: the ratio of code borrowed from FOSS over the code developed by project maintainers. In this paper, we replicate the work of Massacci and Pashchenko and we expand the analysis to include level-1 transitive dependencies to study technical leverage in the fastest-growing NPM ecosystem. We investigated 14,042 NPM library releases and found that both opportunities and risks of technical leverage are magnified in the NPM ecosystem. Small-medium libraries leverage 2.5x more code from FOSS than their code, while large libraries leverage only 3% of FOSS code in their projects. Our models indicate that technical leverage shortens the release cycle for small-medium libraries. However, the risk of vulnerability exposure is 4-7x higher for libraries with high technical leverage. We also expanded our replication study to include the first level of transitive dependencies, and show that the results still hold, albeit with significant changes in the magnitude of both opportunities and risks of technical leverage. Our results indicate the extremes of opportunities and risks in NPM, where high technical leverage enables fast releases but comes at the cost of security risks.

journal article

LitStream Collection

An entropy-based measure of fork diversity and its correlations with open source software projects’ received contributions

Wu, Xiangchen; Wang, Liang; Zheng, Zhiwen; Sang, Baihui; Zhang, Jierui; Tao, Xianping

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10668-4pmid: N/A

The fork-and-pull-based method is an important way for open-source software (OSS) projects to receive contributions. In this study, we introduce a novel metric called fork entropy, inspired by biodiversity, to measure the diversity of OSS projects’ forks beyond their simple counts. Based on Rao’s quadratic entropy, the metric measures the diversity of forks in changing project files. We validate the proposed metric through empirical studies on 102 OSS projects from the Github and Gitlab platforms. The results show significant correlations between the fork entropy of a project and its contributions received with respect to external productivity, acceptance rate of external pull requests, and number of reported bugs. Our findings also reveal significant interactions between fork entropy and other factors, such as the number of forks. Furthermore, the time-shift correlation suggests that the historical impact of the fork entropy, along with other control variables, remains effective for up to twenty months. Based on these insights, we propose to predict a project’s received contributions using fork entropy and other control variables with both a classic linear ARMAX model (Autoregressive Moving Average with Exogenous Variables) and a deep, Transformer-based prediction model. Compared to making predictions using only current data, the models show improved performance in terms of higher prediction accuracy and faster convergence by including historical data. In summary, this work presents a comprehensive study on the correlations and temporal dependencies between the diversity of an OSS project’s forks, measured by the proposed fork entropy, and its received contributions. These findings provide insights for project maintainers and contributors to comprehend and coordinate their forking practices.

journal article

Open Access Collection

On the effects of program slicing for vulnerability detection during code inspection

Papotti, Aurora; Tuma, Katja; Massacci, Fabio

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10636-ypmid: 40196710

Slicing is a fault localization technique that has been proposed to support debugging and program comprehension. Yet, its empirical effectiveness during code inspection by humans has received limited attention. The goal of our study is two-fold. First, we aim to define what it means for a code reviewer to identify the vulnerable lines correctly. Second, we investigate whether reducing the number of to-be-inspected lines by method-level slicing supports code reviewers in detecting security vulnerabilities. We propose a novel approach based on the notion of a δ\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\delta $$\end{document}-neighborhood (intuitively based on the idea of the context size of the command git diff) to define correctly identified lines. Then, we conducted a multi-year controlled experiment (2017-2023) in which MSc students attending security courses (n=236\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$n=236$$\end{document}) were tasked with identifying vulnerable lines in original or sliced Java files from Apache Tomcat. We provide perfect seed lines for a slicing algorithm to control for confounding factors. Each treatment differs in the pair (Vulnerability, Original/Sliced) with a balanced design with vulnerabilities from the OWASP Top 10 2017: A1 (Injection), A5 (Broken Access Control), A6 (Security Misconfiguration), and A7 (Cross-Site Scripting). To generate smaller slices for human consumption, we used a variant of intra-procedural thin slicing. We report the results for δ=0\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\delta = 0$$\end{document} which corresponds to exactly matching the vulnerable ground truth lines, and δ=3\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\delta = 3$$\end{document} which represents the scenario of identifying the vulnerable area. For both cases, we found that slicing helps in ‘finding something’ (the participant has found at least some vulnerable lines) as opposed to ‘finding nothing’. For the case of δ=0\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\delta = 0$$\end{document} analyzing a slice and analyzing the original file are statistically equivalent from the perspective of lines found by those who found something. With δ=3\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\delta = 3$$\end{document} slicing helps to find more vulnerabilities compared to analyzing an original file, as we would normally expect. Given the type of population, additional experiments are necessary to be generalized to experienced developers.

journal article

LitStream Collection

Predicting the understandability of computational notebooks through code metrics analysis

Ghahfarokhi, Mojtaba Mostafavi; Asadi, Alireza; Asgari, Arash; Mohammadi, Bardia; Heydarnoori, Abbas

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10651-zpmid: N/A

Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the understandability of the notebook code and identify the notebook metrics that play a significant role in its understandability. The level of code understandability is a qualitative variable closely associated with the user’s opinion about the code. Traditional approaches to measuring it either use limited questionnaires to review a few code pieces or rely on metadata such as likes and votes in software repositories. In our approach, we enhanced the measurement of the understandability level of Jupyter notebooks by leveraging user opinions related to code understandability within a software repository. As a case study, we started with 542,051 Kaggle Jupyter notebooks, compiled in a dataset named DistilKaggle, which we introduced in our previous research. To identify user comments associated with code understandability, we utilized a fine-tuned DistilBERT transformer. We established a user-opinion-based criterion for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments and the total views of the notebook received by the notebook. We refer to this criterion as User Opinion Code Understandability (UOCU), which has been proven to be much more effective than previous approaches. A hybrid approach combining UOCU with total upvotes further improved this criterion. Additionally, we trained machine learning models to classify notebook understandability solely based on notebook metrics. We collected 34 metrics for 132,723 final notebooks using the hybrid approach criterion. Our predictive model, built using a Random Forest classifier, achieved 89% accuracy in classifying code understandability levels in computational notebooks.

journal article

Open Access Collection

A comparative study on reward models for user interface adaptation with reinforcement learning

Gaspar-Figueiredo, Daniel; Fernández-Diego, Marta; Abrahão, Silvia; Insfran, Emilio

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10659-5pmid: N/A

ContextAdapting the User Interface (UI) of software systems to users’ requirements and their context of use is a challenging task. It involves determining the right adaptation, at the right time and place, to make it valuable for end-users. We believe that recent progress in Machine Learning (ML) techniques could provide useful ways in which to support adaptation more effectively. In particular, Reinforcement Learning (RL) has proven to be effective in planning a sequence of UI adaptations over a long time horizon. However, RL requires either manually specifying a reward function or learning a reward model. Currently there is no empirical evidence supporting the usefulness of reward models for UI adaptation.ObjectiveThis paper presents a confirmatory empirical study aimed at investigating the effectiveness of two different approaches to generating reward models in the context of UI adaptation using reinforcement learning: (1) a reward model derived exclusively from predictive Human-Computer Interaction (HCI) models (AUI-HCI), and (2) a reward model derived from predictive HCI models augmented by human feedback (AUI-HCI-HF), compared to non-adaptive (NA) interfaces.MethodA controlled experiment with an AB/BA crossover design was conducted to evaluate the impact of these reward models on user experience, measured through objective and subjective engagement, as well as user satisfaction. Our study contributes to the understanding of how reward modeling can facilitate UI adaptation through RL.ResultsThe results showed a significant improvement in objective engagement for AUI-HCI-HF compared to non-adaptive interfaces. However, no significant differences were found between AUI-HCI and non-adaptive interfaces for any of the other measurements, across any conditions.ConclusionIntegrating human feedback into RL reward models enhances objective engagement, but its impact on subjective engagement and user satisfaction remains limited. While AUI-HCI-HF shows promise for improving interaction metrics, further research is needed to better align reward models with broader user perceptions and preferences, particularly compared to non-adaptive interfaces.

journal article

Open Access Collection

Quantum circuit mutants: Empirical analysis and recommendations

Mendiluze Usandizaga, Eñaut; Ali, Shaukat; Yue, Tao; Arcaini, Paolo

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10643-zpmid: N/A

As a new research area, quantum software testing lacks systematic testing benchmarks to assess testing techniques’ effectiveness. Recently, some open-source benchmarks and mutation analysis tools have emerged. However, there is insufficient evidence on how various quantum circuit characteristics (e.g., circuit depth, number of quantum gates), algorithms (e.g., Quantum Approximate Optimization Algorithm), and mutation characteristics (e.g., mutation operators) affect the detection of mutants in quantum circuits. Studying such relations is important to systematically design faulty benchmarks with varied attributes (e.g., the difficulty in detecting a seeded fault) to facilitate assessing the cost-effectiveness of quantum software testing techniques efficiently. To this end, we present a large-scale empirical evaluation with more than 700K faulty benchmarks (quantum circuits) generated by mutating 382 real-world quantum circuits. Based on the results, we provide valuable insights for researchers to define systematic quantum mutation analysis techniques. We also provide a tool to recommend mutants to users based on chosen characteristics (e.g., a quantum algorithm type) and the required difficulty of detecting mutants. Finally, we also provide faulty benchmarks that can already be used to assess the cost-effectiveness of quantum software testing techniques.

journal article

LitStream Collection

Predicting long time contributors with knowledge units of programming languages: an empirical study

Ahasanuzzaman, Md; Oliva, Gustavo A.; Hassan, Ahmed E.

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10655-9pmid: N/A

Long-time contributors (LTCs) are essential for the sustainability of open source software (OSS) projects, but unfortunately many developers leave early. Predicting potential LTCs early in their tenure allows project maintainers to effectively allocate resources and mentoring to enhance their development and retention. Prior study shows that developers are primarily motivated to join OSS projects by opportunities to learn and enhance their skills at different areas, including the aspects of programming languages. This motivation plays a crucial role in their continued engagement and contributions to projects. Mapping programming language expertise to developers and characterizing projects in terms of how they use programming languages can help identify developers who are more likely to become LTCs. However, prior studies on predicting LTCs do not consider programming language skills. Towards filling this gap, this paper reports an empirical study on the usage of knowledge units (KUs) of the Java programming language to predict LTCs. A KU is a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. We select 75 real-world actively maintained Java projects from GitHub. Next, we build a prediction model called KULTC, which leverages KU-based features along five different dimensions. To engineer these features, we detect and analyze KUs from the studied 75 Java projects (spanning a total of 353K commits and 168K pull requests) as well as 4,219 other Java projects in which the studied developers previously worked (spanning a total of 1.7M commits). We compare the performance of KULTC with the state-of-the-art model, which we call BAOLTC. Even though KULTC focuses exclusively on the programming language perspective, KULTC achieves a median AUC of at least 0.75 and significantly outperforms BAOLTC. Combining the features of KULTC with the features of BAOLTC results in an enhanced model (KULTC+BAOLTC) that significantly outperforms BAOLTC across different settings with a normalized AUC improvement of 16.5%. Our feature importance analysis with SHAP reveals that developer expertise in the studied project is the most influential feature dimension for predicting LTCs. Finally, we develop a cost-effective model (KULTC_DEV_EXP+BAOLTC) that significantly outperforms BAOLTC. These encouraging results can be helpful to researchers who wish to further study the developers’ engagement/retention to OSS projects or build models for predicting LTCs. Future work in this area should thus (i) consider KULTC as a baseline model and (ii) consider KU-based features in the design of models that predict LTCs.

journal article

LitStream Collection

Leveraging encoder-only large language models for mobile app review feature extraction

Motger, Quim; Miaschi, Alessio; Dell’Orletta, Felice; Franch, Xavier; Marco, Jordi

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10660-ypmid: N/A

Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that these methods improve the precision and recall of extracted features and enhance performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction.

journal article

LitStream Collection

RAG-Driven multiple assertions generation with large language models

Liu, Zhuang; Wang, Hailong; Xu, Tongtong; Wang, Bei

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10641-1pmid: N/A

Software testing is one of the most crucial parts of the software development life cycle. Developers spend substantial amount of time and effort on software testing. Recently, there has been a growing scholarly interest in the automation of software testing. However, recent studies have revealed significant limitations in the quality and efficacy of the generated assert statements. These limitations primarily arise due to: (i) the inherent complexity involved in generating assert statements that are both meaningful and effective; (ii) the challenge of capturing the relationship between multiple assertions in a single test case. In recent research, deep learning techniques have been employed to generate meaningful assertions. However, it is typical for a single assertion to be generated for each test case, which contradicts the current situation where over 40% of test cases contain multiple assertions. Compared with deep learning techniques, the advantages of large language models (LLMs) in test generation tasks have been proven. This paper proposes a new approach named ALLMAssert (Augmented Large Language Model Assertion Generation) to automatically generate multiple assertions for test methods. ALLMAssert exploits two LLMs to collaboratively generate test assertions for developers. ALLMAssert first fine-tune the codellama-34B-instruct model to obtain a specialized model for multi-assert generation. We then mine more contextual information in the Java project. Through a series of information augmentation steps, we prompt the base LLM to correct the assert statements generated by the fine-tuned LLM. To evaluate the effectiveness of our approach, we conduct extensive experiments on the dataset built on the top of Methods2Test dataset. Experimental results show that ALLMAssert achieves scores of 56.61%, 20.43%, and 15.07% in terms of CodeBLEU, accuracy and perfect prediction and substantially outperforms the baselines. Furthermore, we evaluate the effectiveness of ALLMAssert on the task of bug detection and the result indicates that the assert sequences generated by ALLMAssert can assist in exposing 76 real-world bugs extracting from Defects4J, outperforming the SOTA approaches by a large margin as well.

journal article

LitStream Collection

Understanding refactorings in Elixir functional language

da Matta Vegi, Lucas Francisco; Valente, Marco Tulio

2025 Empirical Software Engineering

doi: 10.1007/s10664-025-10652-ypmid: N/A

Elixir is a modern functional language known for its robustness and scalability. This language has seen a growing adoption by companies worldwide in the last 12 years. Despite this fact, and to the best of our knowledge, there are few works in the literature focused on studying refactoring strategies for code implemented with this language. In a preliminary and previous study, we conducted a systematic literature review to provide an initial list of refactorings for Elixir. Aiming to expand the results of this preliminary study, in this work we use a mixed methodology based on a broader systematic literature review, a grey literature review, and the mining of artifacts in GitHub repositories to prospect and document new refactorings for this language. As a result, we propose a comprehensive catalog of 82 refactorings, including 14 new ones specific to Elixir, 32 aimed at functional languages, 11 Erlang-specific transformations compatible with Elixir, and 25 traditional refactorings cataloged by Fowler, which are also compatible with Elixir code. We validated this catalog by surveying 144 experienced Elixir developers from 42 countries spanning all continents. In this survey, we assessed the levels of relevance and prevalence of each refactoring in the catalog. We show that 93% of the refactorings in Elixir are at least moderately relevant, suggesting they can improve the code quality of Elixir systems. Furthermore, 71% of these refactorings are frequently used in production code. Our results have practical implications related to the learning and use of refactorings in Elixir.

Showing 1 to 10 of 22 Articles

Articles per page

Empirical Software Engineering

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Related Journals: