ACM Transactions on Software Engineering and Methodology (TOSEM)

ACM Transactions on Software Engineering and Methodology (TOSEM) | DeepDyve

journal article

LitStream Collection

Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study

Fu, Yujia; Liang, Peng; Tahir, Amjed; Li, Zengyang; Shahin, Mojtaba; Yu, Jiaxin; Chen, Jinfu

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

Modern code generation tools utilizing AI models like Large Language Models have gained increased popularity due to their ability to produce functional code. However, their usage presents security challenges, often resulting in insecure code merging into the code base. Thus, evaluating the quality of generated code, especially its security, is crucial. While prior research explored various aspects of code generation, the focus on security has been limited, mostly examining code produced in controlled environments rather than open source development scenarios. To address this gap, we conducted an empirical study, analyzing code snippets generated by GitHub Copilot and two other AI code generation tools (i.e., CodeWhisperer and Codeium) from GitHub projects. Our analysis identified 733 snippets, revealing a high likelihood of security weaknesses, with 29.5% of Python and 24.2% of JavaScript snippets affected. These issues span 43 Common Weakness Enumeration (CWE) categories, including significant ones like CWE-330: Use of Insufficiently Random Values, CWE-94: Improper Control of Generation of Code, and CWE-79: Cross-site Scripting. Notably, eight of those CWEs are among the 2023 CWE Top-25, highlighting their severity. We further examined using Copilot Chat to fix security issues in Copilot-generated code by providing Copilot Chat with warning messages from the static analysis tools, and up to 55.5% of the security issues can be fixed. We finally provide the suggestions for mitigating security issues in generated code.

journal article

Open Access Collection

VexIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

VenkataKeerthy, S.; Banerjee, Soumya; Dey, Sayan; Andaluri, Yashas; PS, Raghul; Kalyanasundaram, Subrahmanyam; Pereira, Fernando Magno Quintão; Upadrasta, Ramakrishna

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

doi: 10.1145/3721481pmid: N/A

Binary similarity involves determining whether two binary programs exhibit similar functionality with applications in vulnerability detection, malware analysis, and copyright detection. However, variations in compiler settings, target architectures, and deliberate code obfuscations significantly complicate the similarity measurement by effectively altering the syntax, semantics, and structure of the underlying binary. To address these challenges, we propose VexIR2Vec, a robust, architecture-neutral approach based on VEX-IR to solve binary similarity tasks. VexIR2Vec consists of three key components: a peephole extractor, a normalization engine (VexINE), and an embedding model (VexNet). The process to build program embeddings starts with the extraction of sequences of basic blocks, or peepholes, from control-flow graphs via random walks, capturing structural information. These generated peepholes are then normalized using VexINE, which applies compiler-inspired transformations to reduce architectural and compiler-induced variations. Embeddings of peepholes are generated using representation learning techniques, avoiding Out-of-Vocabulary (OOV) issues. These embeddings are then fine-tuned with VexNet, a feed-forward Siamese network that maps functions into a high-dimensional space for diffing and searching tasks in an application-independent manner.We evaluate VexIR2Vec against five baselines—BinDiff, DeepBinDiff, SAFE, BinFinder, and histograms of opcodes—on a dataset comprising 2.7M functions and 15.5K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. The experiments span four adversarial settings—cross-optimization, cross-compilation, cross-architecture, and obfuscations—that are typically exploited by malware and vulnerabilities. In diffing experiments, VexIR2Vec outperforms the nearest baseline in these four scenarios by \(40\%\) , \(18\%\) , \(21\%\) , and \(60\%\) , respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of 0.76, the nearest baseline, by \(46\%\) . Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open source tools. VexIR2Vec is \(\approx 3.1\) – \(3.5\times\) faster than the closest baselines and orders-of-magnitude faster than other tools.

journal article

Open Access Collection

OpTime: Statically Reducing the Execution Time of Microbenchmark Suites Using Stability Metrics

Japke, Nils; Grambow, Martin; Laaber, Christoph; Bermbach, David

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

doi: 10.1145/3715322pmid: N/A

Performance regressions have a tremendous impact on the quality of software. One way to catch regressions before they reach production is executing performance tests before deployment, e.g., using microbenchmarks, which measure performance at subroutine level. In projects with many microbenchmarks, this may take several hours due to repeated execution to get accurate results, disqualifying them from frequent use in CI/CD pipelines. We propose µOpTime, a static approach to reduce the execution time of microbenchmark suites by configuring the number of repetitions for each microbenchmark. Based on the results of a full, previous microbenchmark suite run, µOpTime determines the minimal number of (measurement) repetitions with statistical stability metrics that still lead to accurate results. We evaluate µOpTime with an experimental study on 14 open-source projects written in two programming languages and five stability metrics. Our results show that (i) µOpTime reduces the total suite execution time (measurement phase) by up to 95.83% (Go) and 94.17% (Java), (ii) the choice of stability metric depends on the project and programming language, (iii) microbenchmark warmup phases have to be considered for Java projects (potentially leading to higher reductions), and (iv) µOpTime can be used to reliably detect performance regressions in CI/CD pipelines.

journal article

LitStream Collection

The Sustainability Face of Automated Program Repair Tools

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

doi: 10.1145/3744900pmid: N/A

Automated program repair (APR) aims to automatize the process of repairing software bugs in order to reduce the cost of maintaining software programs. While APR accuracy has significantly improved in recent years, its energy impact remains unstudied. The field of green software research aims to measure the energy consumption required to develop, maintain, and use software products. Our main goal is to define the foundation for measuring the energy consumption of the APR activity. We state that an environmentally sustainable (or green) APR tool achieves a good balance between the ability to correctly repair bugs and the amount of energy consumed during such process. We measure the energy consumption of 10 traditional APR tools for Java and 11 fine-tuned large-language models (LLM) trying to repair real bugs from Defects4J. The results of this study show the existing tradeoff between energy consumption and repairability. In particular, APR tools such as TBar and RepairLlama repair more bugs than other approaches at the expense of a higher energy consumption. Other tools, such as SimFix and the LLM CodeT5-large, provide a good tradeoff between energy consumption and repairability. We also present guidelines consisting of a set of recommendations for developing greener APR.

journal article

LitStream Collection

Distinguishing GUI Component States for Blind Users Using Large Language Models

Zhang, Mengxi; Liu, Huaxiao; Du, Changhao; Wang, Tengmei; Li, Han; Huang, Pei; Chen, Chunyang

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

doi: 10.1145/3722106pmid: N/A

Graphical User Interfaces (GUIs) serve as the primary medium for user interaction with mobile applications (apps). Within these GUIs, editable text views, buttons, and other visual elements exhibit different states following user actions. However, developers often present these states only in various colors without providing textual hints for blind users. This results in significant difficulties for blind users to discern the transitions in component states, thereby hindering their ability to proceed with subsequent actions. Traditional rule-based methods and attribute settings often struggle to adapt to diverse component styles and fail to address the component state changes influenced by context. Recently, pre-trained Large Language Models (LLMs) have demonstrated their generalization ability to various downstream tasks. In this work, we leverage LLMs and propose a tool called Component states distinguishing GPT (CasGPT) to automatically distinguish component states in GUIs and provide corresponding textual hints, thereby aiding blind users in app usage. Our experiments demonstrate that CasGPT is a lightweight approach capable of accurately distinguishing component states (accuracy = 86.5%). The usefulness of our method is validated through a user study, where participants expressed positive attitudes toward it. Also, we compare and find that our method outperforms other open source LLMs and different versions of GPT.

journal article

LitStream Collection

An Empirical Study of Code Simplification Methods inCodeIntelligence Tasks

Shen, Zongwen; Li, Yuning; Ge, Jidong; Chen, Xiang; Li, Chuanyi; Huang, Liguo; Luo, Bin

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

doi: 10.1145/3720540pmid: N/A

In recent years, pre-trained language models have seen significant success in natural language processing and have been increasingly applied to code-related tasks. Code intelligence tasks have shown promising performance with the support of code pre-trained language models. Pre-processing code simplification methods have been introduced to prune code tokens from the model’s input while maintaining task effectiveness. These methods improve the efficiency of code intelligence tasks while reducing computational costs. Post-prediction code simplification methods provide explanations for code intelligence task outcomes, enhancing the reliability and interpretability of model predictions. However, comprehensive evaluations of these methods across diverse code pre-trained model architectures and code intelligence tasks are lacking. To assess the effectiveness of code simplification methods, we conduct an empirical study integrating these code simplification methods with various pre-trained code models across multiple code intelligence tasks.Our empirical findings suggest that developing task-specific code simplification methods would be beneficial. Then, we recommend leveraging post-prediction methods to summarize prior knowledge, which can pre-process code simplification strategies. Moreover, establishing more evaluation mechanisms for code simplification is crucial. Finally, we propose incorporating code simplification methods into the pre-training phase of code pre-trained models to enhance their program comprehension and code representation capabilities.

journal article

Open Access Collection

Prompting Techniques for Secure Code Generation: A Systematic Investigation

Tony, Catherine; Díaz Ferreyra, Nicolás E.; Mutas, Markus; Dhif, Salem; Scandariato, Riccardo

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

doi: 10.1145/3722108pmid: N/A

Large Language Models (LLMs) are gaining momentum in software development with prompt-driven programming enabling developers to create code from Natural Language (NL) instructions. However, studies have questioned their ability to produce secure code and, thereby, the quality of prompt-generated software. Alongside, various prompting techniques that carefully tailor prompts have emerged to elicit optimal responses from LLMs. Still, the interplay between such prompting strategies and secure code generation remains under-explored and calls for further investigations. Objective: In this study, we investigate the impact of different prompting techniques on the security of code generated from NL instructions by LLMs. Method: First, we perform a systematic literature review to identify the existing prompting techniques that can be used for code generation tasks. A subset of these techniques are evaluated on GPT-3, GPT-3.5, and GPT-4 models for secure code generation. For this, we used an existing dataset consisting of 150 NL security-relevant code generation prompts. Results: Our work (i) classifies potential prompting techniques for code generation (ii) adapts and evaluates a subset of the identified techniques for secure code generation tasks, and (iii) observes a reduction in security weaknesses across the tested LLMs, especially after using an existing technique called Recursive Criticism and Improvement (RCI), contributing valuable insights to the ongoing discourse on LLM-generated code security.

journal article

LitStream Collection

Unraveling Code Clone Dynamics in Deep Learning Frameworks

Assi, Maram; Hassan, Safwat; Zou, Ying

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

doi: 10.1145/3721125pmid: N/A

Deep Learning (DL) frameworks play a critical role in advancing AI, and their rapid growth underscores the need for a comprehensive understanding of software quality and maintainability. DL frameworks, like other systems, are prone to code clones. Code clones refer to identical or highly similar source code fragments within the same project or even across different projects. Code cloning can have positive and negative implications for software development, influencing maintenance, readability, and bug propagation. While the existing studies focus on studying clones in DL-based applications, to our knowledge, no work has been done investigating clones, their evolution, and their impact on the maintenance of DL frameworks. In this article, we aim to address the knowledge gap concerning the evolutionary dimension of code clones in DL frameworks and the extent of code reuse across these frameworks. We empirically analyze code clones in nine popular DL frameworks, i.e., TensorFlow, Paddle, PyTorch, Aesara, Ray, MXNet, Keras, Jax, and BentoML, to investigate (1) the characteristics of the long-term code cloning evolution over releases in each framework, (2) the short-term, i.e., within-release, code cloning patterns and their influence on the long-term trends, and (3) the file-level code clones within the DL frameworks. Our findings reveal that DL frameworks adopt four distinct cloning trends: “Serpentine,” “Rise and Fall,” “Decreasing,” and “Stable” and that these trends present some common and distinct characteristics. For instance, bug-fixing activities persistently happen in clones irrespective of the clone evolutionary trend but occur more in the “Serpentine” trend. Moreover, the within-release level investigation demonstrates that short-term code cloning practices impact long-term cloning trends. The cross-framework code clone investigation reveals the presence of functional and architectural adaptation file-level cross-framework code clones across the nine studied frameworks. We provide insights that foster robust clone practices and collaborative maintenance in the development of DL frameworks.

journal article

Open Access Collection

Stress Testing Control Loops in Cyber-Physical SystemsRCR Report

Mandrioli, Claudio; Shin, Seung Yeob; Maggio, Martina; Bianculli, Domenico; Briand, Lionel

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

doi: 10.1145/3733715pmid: N/A

This is the Replicated Computational Results (RCR) Report for the article ‘Stress Testing Control Loops in Cyber-Physical Systems’. The article proposes a novel approach for testing Cyber-Physical Systems (CPS) based on the integration of the guarantees that can be provided with the control theoretical models into the software testing practices. This RCR report describes how to reproduce the empirical results of the article. We make available the different scripts needed to fully replicate the results obtained in our article.

journal article

LitStream Collection

SCOPE: Performance Testing for Serverless Computing

Wen, Jinfeng; Chen, Zhenpeng; Zhao, Jianshu; Sarro, Federica; Ping, Haodi; Zhang, Ying; Wang, Shangguang; Liu, Xuanzhe

2025 ACM Transactions on Software Engineering and Methodology (TOSEM)

doi: 10.1145/3717609pmid: N/A

Serverless computing is a popular cloud computing paradigm that has found widespread adoption across various online workloads. It allows software engineers to develop cloud applications as a set of functions (called serverless functions). However, accurately measuring the performance (i.e., end-to-end response latency) of serverless functions is challenging due to the highly dynamic nature of the environment in which they run. To tackle this problem, a potential solution is to apply checks of performance testing techniques to determine how many repetitions of a given serverless function across a range of inputs are needed to cater to the performance fluctuation. However, the available literature lacks performance testing approaches designed explicitly for serverless computing. In this article, we propose the first serverless computing-oriented performance testing (SCOPE) approach. SCOPE takes into account the unique performance characteristics of serverless functions, such as their short execution durations and on-demand triggering. As such, SCOPE is designed as a fine-grained analysis approach. SCOPE incorporates the accuracy check and the consistency check to obtain the accurate and reliable performance of serverless functions. The evaluation shows that SCOPE provides testing results with 97.25% accuracy, 33.83 percentage points higher than the best currently available technique. Moreover, the superiority of SCOPE over the state-of-the-art holds on all functions that we study.

Showing 1 to 10 of 26 Articles

Articles per page

ACM Transactions on Software Engineering and Methodology (TOSEM)

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

0001

Related Journals: