Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical StudyFu, Yujia; Liang, Peng; Tahir, Amjed; Li, Zengyang; Shahin, Mojtaba; Yu, Jiaxin; Chen, Jinfu
doi: 10.1145/3716848pmid: N/A
Modern code generation tools utilizing AI models like Large Language Models have gained increased popularity due to their ability to produce functional code. However, their usage presents security challenges, often resulting in insecure code merging into the code base. Thus, evaluating the quality of generated code, especially its security, is crucial. While prior research explored various aspects of code generation, the focus on security has been limited, mostly examining code produced in controlled environments rather than open source development scenarios. To address this gap, we conducted an empirical study, analyzing code snippets generated by GitHub Copilot and two other AI code generation tools (i.e., CodeWhisperer and Codeium) from GitHub projects. Our analysis identified 733 snippets, revealing a high likelihood of security weaknesses, with 29.5% of Python and 24.2% of JavaScript snippets affected. These issues span 43 Common Weakness Enumeration (CWE) categories, including significant ones like CWE-330: Use of Insufficiently Random Values, CWE-94: Improper Control of Generation of Code, and CWE-79: Cross-site Scripting. Notably, eight of those CWEs are among the 2023 CWE Top-25, highlighting their severity. We further examined using Copilot Chat to fix security issues in Copilot-generated code by providing Copilot Chat with warning messages from the static analysis tools, and up to 55.5% of the security issues can be fixed. We finally provide the suggestions for mitigating security issues in generated code.
VexIR2Vec: An Architecture-Neutral Embedding Framework for Binary SimilarityVenkataKeerthy, S.; Banerjee, Soumya; Dey, Sayan; Andaluri, Yashas; PS, Raghul; Kalyanasundaram, Subrahmanyam; Pereira, Fernando Magno Quintão; Upadrasta, Ramakrishna
doi: 10.1145/3721481pmid: N/A
Binary similarity involves determining whether two binary programs exhibit similar functionality with applications in vulnerability detection, malware analysis, and copyright detection. However, variations in compiler settings, target architectures, and deliberate code obfuscations significantly complicate the similarity measurement by effectively altering the syntax, semantics, and structure of the underlying binary. To address these challenges, we propose VexIR2Vec, a robust, architecture-neutral approach based on VEX-IR to solve binary similarity tasks. VexIR2Vec consists of three key components: a peephole extractor, a normalization engine (VexINE), and an embedding model (VexNet). The process to build program embeddings starts with the extraction of sequences of basic blocks, or peepholes, from control-flow graphs via random walks, capturing structural information. These generated peepholes are then normalized using VexINE, which applies compiler-inspired transformations to reduce architectural and compiler-induced variations. Embeddings of peepholes are generated using representation learning techniques, avoiding Out-of-Vocabulary (OOV) issues. These embeddings are then fine-tuned with VexNet, a feed-forward Siamese network that maps functions into a high-dimensional space for diffing and searching tasks in an application-independent manner.We evaluate VexIR2Vec against five baselines—BinDiff, DeepBinDiff, SAFE, BinFinder, and histograms of opcodes—on a dataset comprising 2.7M functions and 15.5K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. The experiments span four adversarial settings—cross-optimization, cross-compilation, cross-architecture, and obfuscations—that are typically exploited by malware and vulnerabilities. In diffing experiments, VexIR2Vec outperforms the nearest baseline in these four scenarios by \(40\%\) , \(18\%\) , \(21\%\) , and \(60\%\) , respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of 0.76, the nearest baseline, by \(46\%\) . Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open source tools. VexIR2Vec is \(\approx 3.1\) – \(3.5\times\) faster than the closest baselines and orders-of-magnitude faster than other tools.
OpTime: Statically Reducing the Execution Time of Microbenchmark Suites Using Stability MetricsJapke, Nils; Grambow, Martin; Laaber, Christoph; Bermbach, David
doi: 10.1145/3715322pmid: N/A
Performance regressions have a tremendous impact on the quality of software. One way to catch regressions before they reach production is executing performance tests before deployment, e.g., using microbenchmarks, which measure performance at subroutine level. In projects with many microbenchmarks, this may take several hours due to repeated execution to get accurate results, disqualifying them from frequent use in CI/CD pipelines. We propose µOpTime, a static approach to reduce the execution time of microbenchmark suites by configuring the number of repetitions for each microbenchmark. Based on the results of a full, previous microbenchmark suite run, µOpTime determines the minimal number of (measurement) repetitions with statistical stability metrics that still lead to accurate results. We evaluate µOpTime with an experimental study on 14 open-source projects written in two programming languages and five stability metrics. Our results show that (i) µOpTime reduces the total suite execution time (measurement phase) by up to 95.83% (Go) and 94.17% (Java), (ii) the choice of stability metric depends on the project and programming language, (iii) microbenchmark warmup phases have to be considered for Java projects (potentially leading to higher reductions), and (iv) µOpTime can be used to reliably detect performance regressions in CI/CD pipelines.
The Sustainability Face of Automated Program Repair Toolsdoi: 10.1145/3744900pmid: N/A
Automated program repair (APR) aims to automatize the process of repairing software bugs in order to reduce the cost of maintaining software programs. While APR accuracy has significantly improved in recent years, its energy impact remains unstudied. The field of green software research aims to measure the energy consumption required to develop, maintain, and use software products. Our main goal is to define the foundation for measuring the energy consumption of the APR activity. We state that an environmentally sustainable (or green) APR tool achieves a good balance between the ability to correctly repair bugs and the amount of energy consumed during such process. We measure the energy consumption of 10 traditional APR tools for Java and 11 fine-tuned large-language models (LLM) trying to repair real bugs from Defects4J. The results of this study show the existing tradeoff between energy consumption and repairability. In particular, APR tools such as TBar and RepairLlama repair more bugs than other approaches at the expense of a higher energy consumption. Other tools, such as SimFix and the LLM CodeT5-large, provide a good tradeoff between energy consumption and repairability. We also present guidelines consisting of a set of recommendations for developing greener APR.
Distinguishing GUI Component States for Blind Users Using Large Language ModelsZhang, Mengxi; Liu, Huaxiao; Du, Changhao; Wang, Tengmei; Li, Han; Huang, Pei; Chen, Chunyang
doi: 10.1145/3722106pmid: N/A
Graphical User Interfaces (GUIs) serve as the primary medium for user interaction with mobile applications (apps). Within these GUIs, editable text views, buttons, and other visual elements exhibit different states following user actions. However, developers often present these states only in various colors without providing textual hints for blind users. This results in significant difficulties for blind users to discern the transitions in component states, thereby hindering their ability to proceed with subsequent actions. Traditional rule-based methods and attribute settings often struggle to adapt to diverse component styles and fail to address the component state changes influenced by context. Recently, pre-trained Large Language Models (LLMs) have demonstrated their generalization ability to various downstream tasks. In this work, we leverage LLMs and propose a tool called Component states distinguishing GPT (CasGPT) to automatically distinguish component states in GUIs and provide corresponding textual hints, thereby aiding blind users in app usage. Our experiments demonstrate that CasGPT is a lightweight approach capable of accurately distinguishing component states (accuracy = 86.5%). The usefulness of our method is validated through a user study, where participants expressed positive attitudes toward it. Also, we compare and find that our method outperforms other open source LLMs and different versions of GPT.
An Empirical Study of Code Simplification Methods inCodeIntelligence TasksShen, Zongwen; Li, Yuning; Ge, Jidong; Chen, Xiang; Li, Chuanyi; Huang, Liguo; Luo, Bin
doi: 10.1145/3720540pmid: N/A
In recent years, pre-trained language models have seen significant success in natural language processing and have been increasingly applied to code-related tasks. Code intelligence tasks have shown promising performance with the support of code pre-trained language models. Pre-processing code simplification methods have been introduced to prune code tokens from the model’s input while maintaining task effectiveness. These methods improve the efficiency of code intelligence tasks while reducing computational costs. Post-prediction code simplification methods provide explanations for code intelligence task outcomes, enhancing the reliability and interpretability of model predictions. However, comprehensive evaluations of these methods across diverse code pre-trained model architectures and code intelligence tasks are lacking. To assess the effectiveness of code simplification methods, we conduct an empirical study integrating these code simplification methods with various pre-trained code models across multiple code intelligence tasks.Our empirical findings suggest that developing task-specific code simplification methods would be beneficial. Then, we recommend leveraging post-prediction methods to summarize prior knowledge, which can pre-process code simplification strategies. Moreover, establishing more evaluation mechanisms for code simplification is crucial. Finally, we propose incorporating code simplification methods into the pre-training phase of code pre-trained models to enhance their program comprehension and code representation capabilities.
Prompting Techniques for Secure Code Generation: A Systematic InvestigationTony, Catherine; Díaz Ferreyra, Nicolás E.; Mutas, Markus; Dhif, Salem; Scandariato, Riccardo
doi: 10.1145/3722108pmid: N/A
Large Language Models (LLMs) are gaining momentum in software development with prompt-driven programming enabling developers to create code from Natural Language (NL) instructions. However, studies have questioned their ability to produce secure code and, thereby, the quality of prompt-generated software. Alongside, various prompting techniques that carefully tailor prompts have emerged to elicit optimal responses from LLMs. Still, the interplay between such prompting strategies and secure code generation remains under-explored and calls for further investigations. Objective: In this study, we investigate the impact of different prompting techniques on the security of code generated from NL instructions by LLMs. Method: First, we perform a systematic literature review to identify the existing prompting techniques that can be used for code generation tasks. A subset of these techniques are evaluated on GPT-3, GPT-3.5, and GPT-4 models for secure code generation. For this, we used an existing dataset consisting of 150 NL security-relevant code generation prompts. Results: Our work (i) classifies potential prompting techniques for code generation (ii) adapts and evaluates a subset of the identified techniques for secure code generation tasks, and (iii) observes a reduction in security weaknesses across the tested LLMs, especially after using an existing technique called Recursive Criticism and Improvement (RCI), contributing valuable insights to the ongoing discourse on LLM-generated code security.
Unraveling Code Clone Dynamics in Deep Learning FrameworksAssi, Maram; Hassan, Safwat; Zou, Ying
doi: 10.1145/3721125pmid: N/A
Deep Learning (DL) frameworks play a critical role in advancing AI, and their rapid growth underscores the need for a comprehensive understanding of software quality and maintainability. DL frameworks, like other systems, are prone to code clones. Code clones refer to identical or highly similar source code fragments within the same project or even across different projects. Code cloning can have positive and negative implications for software development, influencing maintenance, readability, and bug propagation. While the existing studies focus on studying clones in DL-based applications, to our knowledge, no work has been done investigating clones, their evolution, and their impact on the maintenance of DL frameworks. In this article, we aim to address the knowledge gap concerning the evolutionary dimension of code clones in DL frameworks and the extent of code reuse across these frameworks. We empirically analyze code clones in nine popular DL frameworks, i.e., TensorFlow, Paddle, PyTorch, Aesara, Ray, MXNet, Keras, Jax, and BentoML, to investigate (1) the characteristics of the long-term code cloning evolution over releases in each framework, (2) the short-term, i.e., within-release, code cloning patterns and their influence on the long-term trends, and (3) the file-level code clones within the DL frameworks. Our findings reveal that DL frameworks adopt four distinct cloning trends: “Serpentine,” “Rise and Fall,” “Decreasing,” and “Stable” and that these trends present some common and distinct characteristics. For instance, bug-fixing activities persistently happen in clones irrespective of the clone evolutionary trend but occur more in the “Serpentine” trend. Moreover, the within-release level investigation demonstrates that short-term code cloning practices impact long-term cloning trends. The cross-framework code clone investigation reveals the presence of functional and architectural adaptation file-level cross-framework code clones across the nine studied frameworks. We provide insights that foster robust clone practices and collaborative maintenance in the development of DL frameworks.
SCOPE: Performance Testing for Serverless ComputingWen, Jinfeng; Chen, Zhenpeng; Zhao, Jianshu; Sarro, Federica; Ping, Haodi; Zhang, Ying; Wang, Shangguang; Liu, Xuanzhe
doi: 10.1145/3717609pmid: N/A
Serverless computing is a popular cloud computing paradigm that has found widespread adoption across various online workloads. It allows software engineers to develop cloud applications as a set of functions (called serverless functions). However, accurately measuring the performance (i.e., end-to-end response latency) of serverless functions is challenging due to the highly dynamic nature of the environment in which they run. To tackle this problem, a potential solution is to apply checks of performance testing techniques to determine how many repetitions of a given serverless function across a range of inputs are needed to cater to the performance fluctuation. However, the available literature lacks performance testing approaches designed explicitly for serverless computing. In this article, we propose the first serverless computing-oriented performance testing (SCOPE) approach. SCOPE takes into account the unique performance characteristics of serverless functions, such as their short execution durations and on-demand triggering. As such, SCOPE is designed as a fine-grained analysis approach. SCOPE incorporates the accuracy check and the consistency check to obtain the accurate and reliable performance of serverless functions. The evaluation shows that SCOPE provides testing results with 97.25% accuracy, 33.83 percentage points higher than the best currently available technique. Moreover, the superiority of SCOPE over the state-of-the-art holds on all functions that we study.