Universal Fuzzing via Large Language Models

Chunqiu Steven Xia; Matteo Paltenghi; Jia Le Tian; Michael Pradel; Lingming Zhang

doi:10.48550/arxiv.2308.04748

Universal Fuzzing via Large Language Models

Xia, Chunqiu Steven;Paltenghi, Matteo;Tian, Jia Le;Pradel, Michael;Zhang, Lingming 2023-08-09 00:00:00 Chunqiu Steven Xia Matteo Paltenghi University of Illinois University of Urbana-Champaign, USA Stuttgart, Germany [email protected] [email protected] Jia Le Tian Michael Pradel Lingming Zhang University of Illinois University of University of Illinois Urbana-Champaign, USA Stuttgart, Germany Urbana-Champaign, USA [email protected] [email protected] [email protected] ABSTRACT in compilers and runtime engines is crucial because they can affect all corresponding downstream applications. Fuzzing has achieved tremendous success in discovering bugs and Traditional fuzzers can be categorized as generation-based [35, vulnerabilities in various software systems. Systems under test 49, 79] or mutation-based [22, 32, 67]. Generation-based fuzzers (SUTs) that take in programming or formal language as inputs, aim to directly synthesize complete code snippets, e.g., using a pre- e.g., compilers, runtime engines, constraint solvers, and software defined grammar for the target language. Instead of synthesizing libraries with accessible APIs, are especially important as they are from scratch, mutation-based fuzzers apply mutation operators or fundamental building blocks of software development. However, transformation rules to a set of high quality fuzzing seeds. Unfor- existing fuzzers for such systems often target a specific language, tunately, both traditional fuzzing approaches face the following and thus cannot be easily applied to other languages or even other limitations and challenges: versions of the same language. Moreover, the inputs generated C1: Tight coupling with target system and language. Traditional by existing fuzzers are often limited to specific features of the in- fuzzers are often designed to target a specific language or a par- put language, and thus can hardly reveal bugs related to other or ticular SUT. However, designing and implementing a fuzzer is new features. This paper presents Fuzz4All, the first fuzzer that extremely time-consuming. For example, Csmith [79], a fuzzer is universal in the sense that it can target many different input for C/C++ compilers, has more than 80K lines of code, while Syz- languages and many different features of these languages. The key kaller [68], a fuzzer for Linux system calls, contains tens of thou- idea behind Fuzz4All is to leverage large language models (LLMs) sands of handcrafted rules [10] to generate and modify system calls. as an input generation and mutation engine, which enables the Because each target language is different, it is often non-trivial to approach to produce diverse and realistic inputs for any practi- reuse the effort of implementing a fuzzer from one input language cally relevant language. To realize this potential, we present a novel to another. Furthermore, fuzzing strategies that work well for one autoprompting technique, which creates LLM prompts that are well- SUT may not work at all for another one. suited for fuzzing, and a novel LLM-powered fuzzing loop, which C2: Lack of support for evolution. Real-world systems are con- iteratively updates the prompt to create new fuzzing inputs. We stantly evolving, e.g., by adding new features to the input language. evaluate Fuzz4All on nine systems under test that take in six differ- Traditional fuzzers designed for a specific version of a language ent languages (C, C++, Go, SMT2, Java and Python) as inputs. The or SUT may lose their effectiveness on a new version and cannot evaluation shows, across all six languages, that universal fuzzing be easily used to test newly implemented features. For example, achieves higher coverage than existing, language-specific fuzzers. Csmith supports only a limited set of features up to C++11, while Furthermore, Fuzz4All has identified 76 bugs in widely used sys- the C++ language has evolved significantly since then. In fact, re- tems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit cent work [21] shows that over a six-month fuzzing period, Csmith quantum computing platform, with 47 bugs already confirmed by was not able to uncover any new bugs in the latest releases of developers as previously unknown. popular GCC and Clang compilers, showing that new versions of compilers are becoming immune to existing fuzzers. 1 INTRODUCTION C3: Restricted generation ability. Even within the scope of a spe- cific target language, both generation-based and mutation-based Fuzz testing [67, 82], also known as fuzzing, is an automated testing fuzzing often are unable to cover a large part the input space. approach for generating inputs designed to expose unexpected be- Generation-based fuzzers rely heavily on an input grammar to haviors, e.g., crashes, of a system under test (SUT). Researchers and synthesize valid code, and additionally are equipped with semantic practitioners have successfully built practical fuzzing tools, which rules that ensure the validity of the synthesized code. To generate have shown great success in finding numerous bugs and vulnera- a high amount of valid fuzzing inputs or to side-step difficult-to- bilities in real-world systems [6]. A particularly important family model language features, generation-based fuzzers often use a sub- of SUTs are systems that take in programming or formal language set of the full language grammar, which limits them to test only a inputs, e.g., compilers, runtime engines, constraint solvers, and subset of all language features. Similarly, mutation-based fuzzers literally any libraries with accessible APIs. Numerous fuzzers have are limited by their mutation operators and require high quality been proposed for such systems since they are the fundamental seeds that can be difficult to obtain. building blocks for software development [12], e.g., finding bugs arXiv:2308.04748v1 [cs.SE] 9 Aug 2023 Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang Our Work. We present Fuzz4All, the first fuzzer that is universal in ★ Universal fuzzing. We introduce a new dimension for fuzzing the sense that it can target many different input languages and many that directly leverages the multi-lingual capabilities of LLMs to different features of theses languages. Our approach fundamentally fuzz-test many SUTs with a wide range of meaningful inputs. differs from existing general-purpose fuzzers, e.g., AFL [50] and ★ Autoprompting for fuzzing. We present a novel autoprompt- libFuzzer [43], which use extremely simple mutations, are unaware ing stage to support both general and targeted fuzzing by auto- of the target language, and therefore struggle to produce meaningful matically distilling user inputs into a prompt that is effective at programming language fuzzing inputs. Instead, our key idea is to generating inputs to the SUT. leverage a large language model (LLM) as an input generation and ★ LLM-powered fuzzing loop. We present an algorithm that con- mutation engine. Because LLMs are pre-trained on large amounts tinuously generates new fuzzing inputs by iteratively modifying of examples in various programming languages and other formal the prompt with selected examples and generation strategies. languages, they come with an implicit understanding of the syntax ★ Evidence of real-world effectiveness . We show across six pop- and semantics of these languages. Fuzz4All leverages this ability by ular languages and nine real-world SUTs (e.g., GCC, CVC5, Go, using an LLM as a universal input generation and mutation engine. javac, and Qiskit) that our approach significantly improves cover- The input to Fuzz4All are user-provided documents describing age compared to state-of-the-art fuzzers (avg. 36.8%) and detects the SUT, and optionally, specific features of the SUT to focus on, 76 bugs, with 47 already confirmed as previously unknown. e.g., in the form of documentation, example code, or formal specifi- ★ Continuous updating. We plan to continue to apply Fuzz4All cations. However, these user inputs may be too verbose to directly on additional targets and languages. Our code, dataset, and up- use as a prompt for the LLM. Instead of requiring the user to manu- to-date progress can be found at: https://fuzz4all.github.io ally engineer a prompt [47], which is time-consuming, we present an autoprompting step that automatically distills all user-provided 2 BACKGROUND & RELATED WORK inputs into a concise and effective prompt for fuzzing. This prompt 2.1 Large Language Models is the initial input to an LLM that generates fuzzing inputs. Since Recent developments in natural language processing (NLP) has continuously sampling with the same prompt would lead to many lead to the wide-spread adoption of large language models (LLMs) similar fuzzing inputs, we present an LLM-powered fuzzing loop, for both natural language [8] and code tasks [78]. State-of-the- which iteratively updates the prompt to generate a diverse set of art LLMs are based on transformers [71] and can be classified into fuzzing inputs. To this end, Fuzz4All combines fuzzing inputs gen- decoder-only (e.g., GPT3 [8] and StarCoder [41]), encoder-only (e.g., erated in previous iterations with natural language instructions, BERT [20] and CodeBERT [23]) and encoder-decoder (BART [40] e.g., asking to mutate these inputs. The LLM-generated fuzzing and CodeT5 [81]) models. More recently, instruction-based LLMs inputs are then passed to the SUT, which we validate against a (e.g., ChatGPT [63] and GPT4 [54]) and LLMs fine-tuned using re- user-provided test oracle, such as checking for system crashes. inforcement learning from human feedback (RLHF) [86] are shown Fuzz4All addresses the previously discussed limitations and to understand and follow complex instructions [4, 55, 63]. challenges of traditional fuzzers. Instead of meticulously designing LLMs are typically either fine-tuned [ 61] or prompted [47] to a single-purpose fuzzer for a specific SUT (C1), Fuzz4All, by using perform specific tasks. Fine-tuning updates the model weights an LLM as the generation engine, can be applied to a wide range of through further training on a task-specific dataset. However, suit- SUTs and input languages. Compared to existing fuzzers that target able datasets may be unavailable, and as LLM sizes continue to a specific version of the SUT or input language (C2), Fuzz4All grow [36], fine-tuning a large LLM is also increasingly expensive. can easily evolve with the target. For example, to fuzz-test a newly Prompting, on the other hand, does not require explicitly updating implemented feature, a user can simply provide documentation the model weights, but provides the LLM with a description of or example code related to that feature. To address the restricted the task, and optionally, a few examples of solving the task. The generation ability of traditional fuzzers (C3), Fuzz4All exploits the process of picking the input (i.e., prompt) is known as prompt en- fact that LLMs are pre-trained on billions of code snippets, enabling gineering [47], where a user tries different input instructions until them to create a wide range of examples that likely obey the syn- finding one that works well. Recently, researchers have proposed tactic and semantic constraints of the target language/SUT. Finally, autoprompting [66], an automatic process that uses LLM gradients Fuzz4All does not require any instrumentation of the SUT, making to select either soft prompts [42, 60], i.e., continuous vector embed- the approach easily applicable in practice. dings, or hard prompts [62, 69], i.e., natural language text. Even We perform an extensive evaluation on six input languages more recently, researchers have substituted gradient-based methods (C, C++, SMT, Go, Java, and Python) and nine SUTs. For each of by computing a proxy score of effectiveness [85]. them, we compare our approach against state-of-the-art generation- This work leverages LLMs for the important problem of fuzzing. based and mutation-based fuzzers. The results show that Fuzz4All Unlike traditional autoprompting and proxy-based approaches, our achieves the highest code coverage across all languages, improving autoprompting strategy directly synthesizes prompts using GPT4 the previous state-of-the-art coverage by 36.8%, on average. Ad- and scores them according to a fuzzing-specific goal. ditionally, we demonstrate that Fuzz4All supports both general fuzzing and fuzzing targeted at specific features of the SUT, which a 2.2 Fuzzing and Testing user decides upon by providing adequate input documents. Finally, Fuzz4All detects 76 bugs across our studied SUTs, with 47 already Fuzz testing aims to generate inputs that cause unexpected behav- confirmed by developers as previously unknown. iors of the SUT. Traditional fuzzers can be classified as generation- Contributions: This paper makes the following contributions: based [35, 49, 79] or mutation-based [22, 32, 67]. Generation-based Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA fuzzers create complete code snippets using pre-defined grammars SUTs. Furthermore, unlike existing techniques, which produce gen- and built-in knowledge of the semantics of the target language. eral fuzzing inputs in a particular language, Fuzz4All additionally Csmith [79] and YARPGen [49] hard-code language specifications supports targeted fuzzing, which can generate code snippets that to ensure the validity of generated code snippets to test C and C++ focus on selected features. compilers, respectively. jsfunfuzz [35] combines a language gram- In addition to fuzzing, LLMs have also been applied to the re- mar with historical bug-triggering code snippets to generate new in- lated problem of unit-test generation [5, 39, 53, 64, 72, 80]. Code- puts to test JavaScript engines. Generation-based fuzzers have also Mosa [39] interleaves traditional search-based software testing been used to test OpenCL [44], the JVM [11], CUDA [34] and deep with querying Codex to generate new unit-tests whenever a cover- learning compilers [45]. Mutation-based fuzzers [67] iteratively age plateau is reached. TestPilot [64] prompts Codex with method perform transformations on seeds to generate new fuzzing inputs. source code and example usages to generate unit-tests and to fix In addition to basic mutations, researchers have developed com- incorrectly generated tests. In contrast to these LLM-based test gen- plex transformations targeted at ensuring type consistency [11, 57], erators, which require a specific type of input (e.g., function source adding historical bug-triggering code snippets [32, 84], and cover- code) and only work for unit testing [53, 64], by using our novel age feedback [3, 22, 46]. To benefit from both generation and muta- autoprompting stage, Fuzz4All can take inputs in arbitrary formats tion, many fuzzers use a combination of both approaches [12, 51]. for both general and targeted fuzzing. Furthermore, such unit-test Different from the above fuzzers, which target specific SUTs or generators often require manual work to check/complete the tests as languages, another line of research is on general-purpose fuzzing. even state-of-the-art LLMs [15, 63] cannot always produce reliable AFL [50] and libFuzzer [43] are general-purpose fuzzers that use oracle. Instead, Fuzz4All leverages widely-used fuzzing oracles, genetic algorithms with a fitness function to prioritize fuzzing such as crashes, and is fully automated. inputs for further mutations that achieve new coverage. These mutations are unaware of the SUT and focus on byte-level transfor- 3 FUZZ4ALL APPROACH mations. That is, when applied on SUTs that receive programming languages as input, general-purpose fuzzers are extremely unlikely We present Fuzz4All, a universal fuzzer that leverages LLMs to to produce valid inputs. Recent work [29] has instead added regular support both general and targeted fuzzing of any SUTs that take in expression-based mutation operators to match common program- programming language input. Figure 1 provides an overview of our ming statements (e.g., change + to -). The simplicity of these mu- approach. Fuzz4All first takes in arbitrary user input that describes tation operators limits the ability of such fuzzers at covering new the fuzzing inputs to be generated, e.g., documentation of the SUT, code, especially in more complex languages, such as C [22, 29]. Poly- example code snippets, or specifications. As the user input may Glot [14] is another language-agnostic fuzzer, which first parses be long, redundant, and partially irrelevant, the approach distills the seed programs into a uniform intermediate representation using it into a concise but informative prompt for fuzzing. To this end, a language-specific grammar and then uses a set of mutation oper- Fuzz4All performs an autoprompting step (Section 3.1) by using a ators to generate new programs. While promising, PolyGlot still large, state-of-the-art distillation LLM to sample multiple different uses a limited set of mutations and cannot achieve the same level of candidate prompts 1 . Each candidate prompt is passed on to the coverage as fuzzers that are designed for a particular language [22]. generation LLM to generate code snippets (i.e., fuzzing inputs) 2 . To complement traditional fuzzing techniques and apply fuzzing Fuzz4All then selects the prompt that produces the highest quality to emerging domains, learning-based fuzzers have been proposed. fuzzing inputs 3 . Prior learning-based techniques mainly focus on training a neural Fuzz4All builds on two models, a distillation LLM that reduces network to generate fuzzing inputs. TreeFuzz [58] parses the train- the given user input and a generation LLM that creates the fuzzing ing corpus into a tree structure and through tree traversal, learns a inputs, to balance the trade-off between the costs and benefits differ- probabilistic, generative model that synthesizes new fuzzing inputs. ent LLMs provide. Because the distillation LLM needs to understand Deep learning models have been used to fuzz PDF parsers [27], and distill arbitrary user input, we use a high-end, large founda- OpenCL [17], C [48], network protocols [83], and JavaScript [38]. tional model with strong natural language understanding abilities. Very recently, researchers have also directly leveraged LLMs for However, directly using such a large model for input generation fuzzing specific libraries. TitanFuzz [18] uses Codex [13] to gen- would be inefficient due to the high inference cost of autoregressive erate seed programs and InCoder [25] to perform template-based generation. Instead, to perform efficient fuzzing, Fuzz4All uses a mutation for fuzzing deep learning libraries [59, 70]. FuzzGPT [19] smaller model as the generation LLM. While our approach is general is another LLM-based deep learning library fuzzer, which leverages across any pairs of distillation and generation LLMs, we implement historical bug-triggering code snippets to either prompt or directly Fuzz4All with state-of-the-art GPT4 [54] and StarCoder [41]. fine-tune LLMs towards generating more unusual code snippets Using the best prompt selected via autoprompting as the initial for more effective fuzzing. input prompt for the generation LLM, we then move on to the Unlike prior learning- and LLM-based fuzzers, Fuzz4All is eas- fuzzing loop (Section 3.2), where Fuzz4All continuously samples ily applicable across many programming languages. Prior work the generation LLM to generate fuzzing inputs 4 . To avoid gener- trains language-specific models or requires language-specific pars- ating many similar fuzzing inputs, Fuzz4All continuously updates ing. Even recent LLM-based techniques [18, 19] are designed specif- the input prompt in each iteration. Specifically, the approach selects ically for deep learning libraries with hand-crafted prompts or a previously generated input as an example 5 , which demonstrates mutation patterns, and therefore cannot be easily extended to other the kind of future inputs we want the model to generate. In addi- tion to the example, Fuzz4All also appends a generation instruction Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang import ("fmt" "math/big") std::expected func main() { (theory Ints The class template std::expected provides operands []float64{2.6, a way to store either of two values. An 2.5} :funs ((NUMERAL Int) object of std::expected at any given time for mode big.ToNearestEven; (- Int Int) Member types Definition mode big.ToPositiveInf; mode (- Int Int Int :left-assoc) value_type(c++23) T { (+ Int Int Int :left-assoc) error_type(c++23) E fmt.Printf(" %s", mode) (* Int Int Int :left-assoc) ... documentation example code specification System Under Test sample int main(){ prompts distillation 4 sample std::expected std::expected std::variant int main(){ provides a way to provides a way to ... LLM std::expected store either a ... store either a ... ... best prompt input prompt fuzzing inputs generation 3 score & LLM std::expected select prompt 6 5 provides a way to std::expected store either a ... int main(){ int main(){ update select provides a way to std::variant std::variant std::expected int main(){ int main(){ input code store either a ... ... ... std::expected std::expected provides a way to prompt snippet ... ... store either a ... code snippets generate-new candidate prompts int main(){ std::expected ... mutate-existing 2 sample selected code generation semantic-equiv snippet LLM generation strategies Autoprompting Fuzzing Loop Figure 1: Overview of Fuzz4All. to use a distillation LLM to generate prompts that distill the infor- Algorithm 1: Autoprompting for fuzzing mation provided by the user, we give the following autoprompting 1 Function Autoprompting: Input : userInput, numSamples instruction to the distillation LLM: “Please summarize the above Output: inputPrompt information in a concise manner to describe the usage and function- 2 greedyPrompt← M (userInput, APInstruction, temp=0) ality of the target”. LetM be the distillation LLM, userInput be 3 candidatePrompts← [ greedyPrompt ] the user input and APInstruction be the autoprompting instruction. 4 while | candidatePrompts | < numSamples do 5 prompt← M (userInput, APInstruction, temp=1) The prompt prompt generated can be formalized as the conditional 6 candidatePrompts← candidatePrompts + [ prompt ] probability:M (prompt| userInput, APInstruction) 7 inputPrompt← argmax Scoring (M (p), SUT) Fuzz4All first generates a candidate prompt using greedy sam- p∈candidatePrompts pling with temperature 0 (line 2). By first sampling with low temper- 8 return inputPrompt ature, the algorithm obtains a plausible solution with a high degree of confidence. This approach is commonly used in other domains, to the initial prompt, which guides the model toward generating e.g., program synthesis [13], where the greedy output is evaluated new fuzzing inputs 6 . This process is repeated while continuously first to check if it can solve the problem. The algorithm then moves passing the generated fuzzing inputs into the SUT and checking on to sampling with higher temperature to obtain more diverse its behavior against a user-defined oracle, such as crashes. prompts (line 5), as done in prior work [13, 77]. Compared to greedy, sampling with high temperature yields different prompts that can 3.1 Autoprompting each provide a unique distilled summary of the user input. Each The following presents the details of the first of two main steps of generated prompt is added to a list of candidate prompts (line 6), Fuzz4All, which distills the given user input via autoprompting until the algorithm reaches the desired number of candidates. into a prompt suitable for fuzzing. The user input may describe the To pick the best input prompt to be used in the fuzzing step, SUT in general, or particular feature of the SUT to be tested. As the algorithm evaluates each candidate prompt by performing a shown in Figure 1, user inputs may include technical documenta- small-scale fuzzing experiment. Specifically, the approach uses each tion, example code, specifications, or even combinations of different prompt as an input to the generation LLM to produce multiple code modalities. Unlike traditional fuzzers that require inputs to follow snippets per prompt. Fuzz4All then scores the generated code snip- a specific format, e.g., code snippets to use as seeds or well-formed pets for each prompt based on a scoring function. While the scoring specifications, Fuzz4All can directly understand the natural lan- function can be based on a variety of different metrics, e.g., cover- guage descriptions or code examples in the user input. However, age, bug finding, or the complexity of generated fuzzing inputs, to some information in the user input may be redundant or irrelevant, make the approach lightweight and general, our scoring function is and hence, directly using the user inputs as a prompt for the gen- the number of unique generated code snippets that are valid, i.e., ac- eration LLM may be ineffective, as confirmed by our ablation study cepted by the target SUT. This metric is chosen since for fuzzing, we in Section 5.3. Therefore, the goal of autoprompting is to generate want fuzzing inputs to be valid or close to valid to trigger logic deep a distilled input prompt that enables effective LLM-based fuzzing. inside the SUT. LetM be the generation LLM, p be a candidate 3.1.1 Autoprompting Algorithm. Algorithm 1 details Fuzz4All’s prompt, isValid be the function that returns 1 if a generated code autoprompting step. The inputs are the user input and the number cis valid and 0 if invalid. Our default scoring function is defined of candidate prompts to generate. The final output is the input as: [isValid(c, SUT)]. Finally, Fuzz4All selects the input c∈M (p) prompt selected to be used for the fuzzing campaign. As our goal is user inputs ... Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA The C++23 std::expected class template provides a way to store either an High level Algorithm 2: Fuzzing loop expected value of type T or an unexpected value of type E. It is useful for description handling functions that may return an error or a valid result. The stored value of feature is allocated directly within the storage occupied by the expected object, 1 Function FuzzingLoop: without dynamic memory allocation. Input : inputPrompt, timeBudget The template parameters are T (the expected value type) and E (the unexpected Descriptions Output: bugs value type). Both types must meet the Destructible requirements, and certain of the inputs types are not allowed. 2 genStrats← [ generate-new, mutate-existing, std::expected provides member functions for construction, destruction, assignment, and accessing the stored values. Observers like operator bool and semantic-equiv ] has_value can be used to check if the object contains an expected value. 3 fuzzingInputs← M (inputPrompt + generate-new) Functions like value, error, and value_or can be used to access the expected G or unexpected values. 4 bugs← Oracle (fuzzingInputs, SUT) Monadic operations like and_then, transform, or_else, and transform_error 5 while timeElapsed < timeBudget do allow chaining operations on expected values and handling errors in a Different functional manner. 6 example← sample (fuzzingInputs, SUT) usages of Modifiers like emplace and swap can be used to construct the expected value 7 instruction← sample (genStrats) target in-place or exchange the contents of expected objects. Non-member functions like operator:= and swap(std::expected) provide comparison and swapping 8 fuzzingInputs← M (inputPrompt + example + functionality. instruction) Helper classes like unexpected, bad_expected_access, and unexpect_t are used to represent unexpected values, exceptions, and in-place construction tags for 9 bugs← bugs + Oracle (fuzzingInputs, SUT) unexpected values in expected objects. 10 return bugs Figure 2: Autoprompting result for std::expected. sampling multiple times using the same input would produce the prompt with the highest score (line 7) as the initial input prompt to same or similar code snippets. For fuzzing, we aim to avoid such re- be used for fuzzing. In summary, our autoprompting step combines peated inputs and instead want to generate a diverse set of fuzzing both prompt generation and scoring, which allows Fuzz4All to au- inputs that cover new code and discover new bugs. To accomplish tomatically generate/select a prompt suitable for the fuzzing target. this goal, we exploit the ability of LLMs to utilize both examples 3.1.2 Example: Autoprompting. Figure 2 shows an example of an and natural language instructions to guide the generation. input prompt generated by our autoprompting algorithm. The ex- The high-level idea of the fuzzing loop is to continuously aug- ample is for fuzzing C++ compilers while focusing specifically on ment the original input prompt by selecting an example fuzzing std::expected, a new feature introduced in C++23. As the user input from previous iterations and by specifying a generation strat- input, we pass the original cppreference documentation [2] to egy. The goal of using an example is to demonstrate the kind of Fuzz4All, which spans multiple screen lengths with small tables code snippet we want the generation LLM to produce. The gener- and verbose descriptions (498 words, 3262 characters). In contrast, ation strategies are designed as instructions on what to do with the distilled input prompt created by the autoprompting algorithm the provided code example. These strategies are inspired by tradi- provides a more concise natural language description of the tar- tional fuzzers, mimicking their ability to synthesize new fuzzing geted feature (214 words, 1410 characters). The input prompt con- inputs (as in generation-based fuzzers) and to produce variants of tains a high-level description of how std::expected is to be used. previously generated inputs (as in mutation-based fuzzers). Before For example, the input prompt contains a concise sentence (high- each new iteration of the fuzzing loop, Fuzz4All appends both an lighted in orange) that summarizes the situations the feature is example and a generation strategy to the input prompt, enabling useful in. Additionally, the input prompt contains descriptions of the generation LLM to continuously create new fuzzing inputs. the inputs, as well as the different usages (i.e., member functions) of the feature. For example, functions and_then, transform, or_else, 3.2.1 Fuzzing Loop Algorithm. Algorithm 2 describes the fuzzing and transform_error have very similar descriptions in the original loop. The inputs are the initial input prompt and the fuzzing budget. documentation, which is repeated for each function. Instead, in the The final output is a set of bugs identified by the user-defined distilled input prompt, these functions are grouped together in a oracle. First, the algorithm initializes the generation strategies concise manner that still illustrates how they can be used. Using (generate-new, mutate-existing, and semantic-equiv), which will the distilled input prompt, Fuzz4All can generate fuzzing inputs be used to modify the input prompt during the fuzzing loop (line 2). that effectively target the std::expected feature of C++ compilers. Figure 3 (top-right) lists our three generation strategies along with 3.1.3 Comparison with Existing Autoprompting Techniques. To the their corresponding instructions. For the first invocation of the best of our knowledge, we are the first to automatically distill knowl- generation LLM, denoted with M , the algorithm does not yet edge from arbitrary user inputs for a software engineering task have any examples of fuzzing inputs. Hence, it appends to the input using black-box autoprompting. Compared to prior work on auto- prompt the generate-new generation instruction, which guides the prompting in NLP [66] and software engineering [73], which opti- model toward producing a first batch of fuzzing inputs (line 3). mize the prompt by accessing model gradients, our autoprompting Next, the algorithm enters the main fuzzing loop (lines 5–9), needs only black-box, sampling access to the distillation LLM. While which continuously updates the prompt to create new fuzzing in- the use of a scoring function to evaluate each prompt is similar to puts. To this end, the algorithm selects an example from the previous recent work in NLP [85], our scoring function directly evaluates batch of generated fuzzing inputs, randomly picking from all those the prompt on the exact downstream task of generating valid code fuzzing inputs that are valid for the SUT (line 6). In addition to the snippets, instead of using an approximate proxy scoring function. example, the algorithm also randomly picks one of the three gen- eration strategies (line 7). The generation strategy either instructs 3.2 Fuzzing Loop the model to mutate the selected example (mutate-existing), to Given the input prompt created in the first step of Fuzz4All, the produce a fuzzing input that is semantically equivalent to the ex- goal the fuzzing loop is to generate diverse fuzzing inputs using a ample (semantic-equiv), or to come up with a new fuzzing input generation LLM. However, due to the probabilistic nature of LLMs, (generate-new). The algorithm concatenates the initial input prompt, Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang strategy name generation instruction Table 1: SUTs and baseline tools. Please create a program which uses complex generate-new {SMT2 logic} for an {SMT solver} Language SUT(s) Baseline tool(s) Version Please create a mutated program that modifies the mutate-existing distillation LLM previous generation C GCC, Clang GrayC [22], Csmith [79] GCC-13.1.1 Please create a semantically equivalent program semantic-equiv to the previous generation C++ G++, Clang++ YARPGen [49] G++-13.1.1 SMT2 supports several theories, SMT2 Z3, CVC5 TypeFuzz [57] CVC5-1.0.5 including integer (declare-const x1 Real) and real arithmetic Go Go go-fuzz [26] go-1.20.6 1 (assert (! (= x1 1))) initial prompt (check-sat) Java javac Hephaestus [11] OpenJDK-javac-18 generate-new generation LLM Python Qiskit MorphQ [56] qiskit-0.43.1 SMT2 supports several theories, (declare-const x1 Int) including integer 4 EXPERIMENTAL DESIGN and real arithmetic (assert (! (= x1 1))) initial prompt (check-sat) We evaluate Fuzz4All on the following research questions: (get-model) example generation LLM • RQ1: How does Fuzz4All compare against existing fuzzers? mutate-existing • RQ2: How effective is Fuzz4All in performing targeted fuzzing? SMT2 supports several theories, (declare-const x1 Int) including integer • RQ3: How do different components contribute to Fuzz4All’s and real arithmetic (assert (! (= x1 1) initial prompt :named a)) effectiveness? (check-sat) example (get-model) generation LLM • RQ4: What real-world bugs does Fuzz4All find? semantic-equiv Figure 3: Fuzzing strategies and example of fuzzing loop. 4.1 Implementation the selected example, and the selected generation strategy into a Fuzz4All is primarily implemented in Python. The autoprompting new prompt, and then queries the generation LLM with this prompt and fuzzing loop components of Fuzz4All contain only 872 LoC. to produce another batch of fuzzing inputs (line 8). Compared to traditional fuzzers, such as Csmith (>80K LoC), which The main fuzzing loop is repeated until the algorithm has ex- need high manual effort to implement generators, Fuzz4All has a hausted the fuzzing budget. For each created fuzzing input, Fuzz4All very lightweight implementation. Fuzz4All uses GPT4 [54] as the passes the input to the SUT. If the user-defined oracle identifies an distillation LLM to perform autoprompting since this model is the unexpected behavior, e.g., a crash, then the algorithm adds a report state-of-the-art for a wide range of NLP-based reasoning tasks [9]. to the set of detected bugs (lines 4 and 9). Specifically, we use the gpt-4-0613 checkpoint with max_token of 500 provided via the OpenAI API [28]. For autoprompting, we sam- 3.2.2 Example: Fuzzing Loop. Figure 3 illustrates how our fuzzing ple four candidate prompts, generate 30 fuzzing inputs each, and loop uses input examples and the generation strategies to create evaluate using a scoring function based on validity rate (as de- different fuzzing inputs. In this case, we are fuzzing an SMT solver scribed in Section 3.1.1). For the fuzzing loop, we use the Hugging where the inputs are logic formulas written in the SMT2 language. Face implementation of the StarCoder [41] model as the generation Initially 1 , there are no examples, and hence, the algorithm uses LLM, which is trained on over one trillion code tokens across over the generate-new strategy to synthesize new fuzzing inputs. Next, 80 languages. Our default setting when generating fuzzing inputs taking a generated, valid fuzzing input as an example, the algo- uses a temperature of 1, a batch size of 30, a maximum output length rithm queries the model to create a new input 2 based on the of 1,024 using nucleus sampling [33] with a top-p of 1. mutate-existing strategy, which aims to mutate the selected exam- ple. We observe that the new fuzzing input subtly modifies the previ- 4.2 Systems Under Test and Baselines ous input by swapping the type of a variable as well as adding some To demonstrate the generality of Fuzz4All, we evaluate it on six in- computation. In the next fuzzing iteration 3 , the algorithm selects put languages and nine SUTs. Table 1 shows each of the languages, the previously generated fuzzing input as the example and uses the SUTs, and the corresponding baseline tools. Note that we compare semantic-equiv generation strategy, which aims to create an input coverage on one SUT per language, with the SUT versions used that does not modify the semantics of the given example. This time, for coverage measurements shown in the last column of Table 1. we observe that the new fuzzing input simply adds a syntax tag to Except for the coverage experiments, we perform fuzzing on the the selected example. In fact, the combination of generation strate- nightly release of each target. Unless otherwise mentioned, we use gies shown in the example helps Fuzz4All to generate a fuzzing unexpected compiler crashes as the oracle and consider a fuzzing input that causes an unexpected crash in the SMT solver. The crash input as valid if it compiles successfully. Each baseline fuzzer is exposes one of the real-world bugs detected by Fuzz4All during run with its default settings. For baseline fuzzers that require input our evaluation, which has been confirmed and fixed by developers. seeds, we use the default seed corpus provided in their replication repository. We now present more evaluation details for each SUT. 3.2.3 Oracle. The fuzzing inputs produced by Fuzz4All during the fuzzing loop can be used to check the behavior of the SUT against 4.2.1 C/C++ Compilers. We target the popular GCC and Clang an oracle to detect bugs. The oracle is custom for each SUT, and it compilers and provide the standard C library documentation as user can be fully defined and customized by the user. For example, when input to Fuzz4All by default. Our baselines include Csmith [79], fuzzing C compilers, a user could define a differential testing oracle a classic generation-based C compiler fuzzer, and GrayC [22], a that compares the compiler behavior under different optimization recent mutation-based fuzzer that uses coverage feedback together levels [79]. In this paper, we focus on simple and easy-to-define with specialized mutation operators. For C++, we target new C++23 oracles, such as crashes due to segmentation faults and internal features by providing the C++23 standard documentation as input assertion failures, with more details discussed in Section 4.2. to Fuzz4All. Our baseline is YARPGen [49], a generation-based Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA Table 2: Fuzz4All against state-of-the-art fuzzers (* indicates fuzzer that extends Csmith with new language features in C++ and statistically significant coverage improvement). generation policies to trigger different compiler optimizations. Target Fuzzer # programs % valid Coverage 4.2.2 SMT Solvers. We run Fuzz4All on Z3 and CVC5 with com- GrayC 104,326 95.96% 167,453 monly enabled developer settings, such as debug and assertion, GCC Csmith 61,883 99.99% 111,668 Fuzz4All 44,324 37.26% *198,927 +18.8% following prior work [57, 75, 76]. Fuzz4All generates SMT for- YARPGen 255,581 99.99% 166,614 mulas as fuzzing inputs using an overview documentation of the G++ Fuzz4All 26,365 40.74% *210,743 +26.5% SMT2 language and SMT solver as input by default. A fuzzing input TypeFuzz 43,001 93.24% 46,174 CVC5 is considered valid if the SMT solver returns either SAT or UNSAT Fuzz4All 36,054 47.63% *57,674 +24.9% without any error. Our baseline is state-of-the-art TypeFuzz [57], go-fuzz 20,002 100.00% 38,024 Go which mutates existing SMT expressions based on newly generated Fuzz4All 22,817 23.02% *43,317 +13.7% Hephaestus 728,217 57.22% 10,285 expressions of the same type. javac Fuzz4All 31,967 49.05% *16,552 +60.9% MorphQ 38,474 100.00% 19,929 4.2.3 Go Toolchain. We run Fuzz4All on the most recent version Qiskit Fuzz4All 33,454 24.90% *34,988 +75.6% of Go. By default, we use the Go standard library documentation as Environment. Experiments are conducted on a 64-core worksta- input to Fuzz4All. As a baseline, we use go-fuzz [26], a coverage- tion with 256 GB RAM running Ubuntu 20.04.5 LTS with 4 NVIDIA guided, mutation-based fuzzer designed for Go, which generates in- RTX A6000 GPUs (only one GPU is used per fuzzing run). puts for various Go standard libraries using handwritten templates. Metrics. We use the widely adopted measure of code coverage for evaluating fuzzing tools [7, 37, 74]. To be uniform, we report the 4.2.4 Java Compiler. We evaluate Fuzz4All on the OpenJDK Java line coverage for each of the targets studied in the evaluation. Fol- compiler, javac, which compiles source code into bytecode. Our de- lowing prior work [37], we use the Mann-Whitney U-test [52] to fault input is the latest standard Java API documentation page. We compute statistical significance and indicate significant (p < 0.05) compare against Hephaestus [11], a recent combined generation- coverage results in applicable tables (Tables 2 and 4) with *. We and mutation-based fuzzer designed for JVM compilers and target- additionally measure the validity rate (% valid) of inputs as the ing type-related bugs. percentage of fuzzing inputs generated that are valid and unique. As Fuzz4All supports both general and targeted fuzzing, to assess 4.2.5 Quantum Computing Platform. We target Qiskit [1], a pop- the effectiveness of targeted fuzzing, we report the hit rate, i.e., ular quantum computing framework [24]. Qiskit is built on top the percentage of fuzzing inputs that use a specific target feature of Python, i.e., both the input program and the compilation are (checked with simple regular expressions). Finally, we also report defined in Python code. Thus, creating a valid input for Qiskit the most important metric and goal of fuzzing: the number of bugs means using the Qiskit Python APIs in a meaningful way, e.g., to detected by Fuzz4All for each of our nine SUTs. create a quantum circuit. It is challenging for traditional synthesis tools to handle dynamically typed general-purpose languages (like Python) [30, 65], not to mention the additional API constraints, 5 RESULTS making fuzzing Qiskit a particularly difficult challenge. Our base- 5.1 RQ1: Comparison against Existing Fuzzers line is MorphQ [56], a recent fuzzer that uses a template- and 5.1.1 Coverage over Time. Figure 4 shows the 24-hour coverage grammar-based approach to generate valid quantum programs and trend of Fuzz4All compared with the baselines, where the solid then applies metamorphic transformations. line shows average coverage and the area indicates the minimum Unlike for the other SUTs, which receive fuzzing inputs in a file, and maximum across five runs. We observe that Fuzz4All achieves to invoke Qiskit, we must run the generated Python program itself. the highest coverage by the end of the fuzzing campaign across all As an oracle, we add statements at the end of the generated Python targets, with an average improvement of 36.8% compared to the top file, which collect all QuantumCircuit objects via Python’s built-in performing baselines. Contrasting with generation-based fuzzers introspection APIs and then apply two oracles on each circuit. The (i.e., YARPGen and MorphQ), Fuzz4All is able to almost immedi- two oracles are directly borrowed from previous work for a fair ately achieve higher coverage, demonstrating the powerful genera- comparison [56]. The first oracle compiles the circuit via a transpile tive ability of LLMs in producing diverse code snippets compared to call with different optimization levels and reports any crash. The traditional program generation techniques. While mutation-based second oracle converts the circuit to its lower-level QASM [16] fuzzers (i.e., go-fuzz and GrayC) are able to achieve higher cov- representation and then reads it back, reporting any crash. erage in the beginning through the use of high quality seeds, the coverage gained via mutations rapidly falls off and Fuzz4All is 4.3 Experimental Setup and Metrics able to slowly but surely cover more code. Note that we include Fuzzing campaigns. For RQ1, we use a fuzzing budget of 24 the autoprompting time as part of the fuzzing budget for a fair hours (including autoprompting), which is used commonly in prior comparison, which incurs negligible overhead (avg. 2.3 minutes per work [37]. To account for variance, we repeat the experiment for fuzzing campaign). both Fuzz4All and the baselines five times. Due to the high cost of Unlike the baseline fuzzers, which reach a coverage plateau by experiments, for later RQs, we use a fuzzing budget of 10,000 gen- the end of the 24-hour period, Fuzz4All keeps finding inputs that erated fuzzing inputs and repeat four times for the ablation study. cover new code, even near the end of the fuzzing campaign. Recall Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang GrayC Fuzz4All 120 YarpGen TypeFuzz seed seed Csmith Fuzz4All Fuzz4All 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 Hours Hours Hours (a) GCC (b) G++ (c) CVC5 go-fuzz seed Hephaestus MorphQ Fuzz4All Fuzz4All Fuzz4All 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 Hours Hours Hours (d) Go (e) javac (f) Qiskit Figure 4: Coverage trend of Fuzz4All against state-of-the-art fuzzers in a 24-hour fuzzing campaign. that during each iteration of Fuzz4All’s fuzzing loop, the original invoke the SUT after each fuzzing iteration for bug detection. Re- input prompt is updated with both a new example and a generation garding validity rate, a general-purpose programming language, strategy (Section 3.2), nudging the LLM to generate new fuzzing such as C, has a relatively lower validity rate compared to domain- inputs. We hypothesize that this allows Fuzz4All to effectively specific languages, such as the SMT2 language used for SMT solvers. generate new and diverse fuzzing inputs even after a long period A more rigorous language, e.g., Go, which does not allow any de- of fuzzing, leading to sustained coverage increase. clared but unused variables, has an even lower validity rate. We also observe a low validity rate for fuzzing quantum computing 5.1.2 Generation Validity, Number, and Coverage. We examine the platforms. As quantum computing is an emerging area with its own number of fuzzing inputs generated and their validity rate across set of library APIs, the generation LLM may not have seen as many our studied SUTs. In Table 2, Column “# programs” represents examples of quantum programs during its training as for more es- the number of unique inputs generated, “% valid” is the percent- tablished languages. Nevertheless, Fuzz4All is still able to leverage age of fuzzing inputs that are valid, and “Coverage” shows the user-provided documentation to generate interesting fuzzing inputs, final coverage obtained by each fuzzer along with the relative im- which leverage quantum library APIs and achieve an impressive cov- provement over the best baseline. We first observe that almost erage improvement (+75.6%) compared to the state-of-the-art fuzzer. all traditional fuzzing tools can achieve a very high validity rate apart from Hephaestus, which purposefully generates invalid code 5.2 RQ2: Effectiveness of Targeted Fuzzing (focused on incorrect types) to check for miscompilation bugs. In contrast, Fuzz4All has a lower percentage of valid fuzzing inputs We now evaluate the ability of Fuzz4All to perform targeted generated (56.0% average reduction compared to baseline tools). fuzzing, i.e., to generate fuzzing inputs that focus on a particu- Furthermore, the raw number of fuzzing inputs generated by base- lar feature. For each target SUT and language, we test by targeting line tools are also much higher. By using an LLM as the generation three different example features and compare them to the setup engine, Fuzz4All is bottlenecked by GPU inference, leading to with general user input, as used for RQ1 (described in Section 4.3). 43.0% fewer fuzzing inputs compared to traditional fuzzers. These features are built-in libraries or functions/APIs (Go, C++ and In spite of the lower validity rate and number of fuzzing inputs, Qiskit), language keywords (C and Java), and theories (SMT). The Fuzz4All generates much more diverse programs compared to tra- user input for the targeted fuzzing runs is documentation of the ditional fuzzing tools, as evidenced by the high coverage obtained particular feature we are focusing on. Table 3 shows the results of (+36.8% average increase). Additionally, even invalid code snippets targeted fuzzing as well as the default general fuzzing used in RQ1. that are close to valid can be useful for fuzzing, as they allow for Each column represents a targeted fuzzing run where we focus finding bugs in the validation logic of the SUT. In Section 5.4, we on one feature. The value in each cell shows the hit rate of the further describe the various types of bugs detected by Fuzz4All, feature (Section 4.3) for a particular fuzzing run. We also include with both valid and invalid code snippets, to additionally showcase the coverage results obtained. the benefit of generating diverse fuzzing inputs. We observe that targeting a specific feature yields a high amount We note that Fuzz4All achieves a wide range of validity rates of fuzzing inputs that directly use the feature, with an average and numbers of fuzzing inputs across different SUTs. The number hit rate of 83.0%. This result demonstrates that Fuzz4All indeed of fuzzing inputs varies across targets due to the varying cost to performs targeted fuzzing by prompting the generation LLM with Coverage (#K lines) Coverage (#K lines) Coverage (#K lines) Coverage (#K lines) Coverage (#K lines) Coverage (#K lines) Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA Table 3: Hit rate and coverage during targeted fuzzing. 5.3 RQ3: Ablation Study C targeted campaign (keywords) To study how each component of Fuzz4All contributes to the over- typedef union goto General all fuzzing effectiveness, we conduct an ablation study based on typedef 83.11% 47.16% 0.48% 4.38% the two key components of Fuzz4All: (a) Autoprompting, the type union 10.80% 80.43% 0.10% 0.32% of initial input prompt provided to the generation LLM; (b) Fuzzing goto 0.22% 0.11% 77.62% 1.16% loop, the use of selected examples and generation strategies. We Coverage 123,226 125,041 120,452 188,148 study three variants for each of the two key components. Table 4 C++ targeted campaign (built-in functions) shows the coverage and validity rate of our studied variants. apply expected variant General apply 70.23% 0.41% 0.68% 0.32% expected 0.26% 79.72% 0.94% 1.33% variant 1.16% 5.98% 93.19% 3.63% Coverage 182,261 175,963 182,333 193,254 5.3.1 Autoprompting. First, we examine the effect of different ini- SMT targeted campaign (theories) tial inputs provided to the generation LLM. To reduce the impact Array BitVec Real General of additional factors, we fix the generation strategy to only use Array 82.23% 2.08% 1.44% 11.07% generate-new and study three variants : 1) no input: does not use BitVec 2.57% 88.48% 0.86% 5.46% Real 1.45% 0.17% 96.01% 17.36% any initial prompts 2) raw prompt: directly use the raw user input as the initial prompt, 3) autoprompt: applies autoprompting to generate Coverage 46,392 48,841 47,619 52,449 the initial prompt. We observe that across all studied languages, the Go targeted campaign (built-in libraries) no input variant achieves the lowest coverage. In no input, we do atomic atomic heap General not provide any initial prompt, which provides useful information atomic 90.09% 0.04% 0.06% 1.01% big 0.18% 97.20% 0.23% 3.63% on the features we want to generate fuzzing inputs for. As such, heap 0.30% 0.04% 91.18% 2.22% the LLM can only generate simple code snippets with high validity Coverage 10,156 12,986 9,790 37,561 rate but is less effective in covering the SUT. We observe a cover- age boost as we use the raw prompt variant, where we provide the Java targeted campaign (keywords) instanceof synchronized finally General raw documentation as the initial prompt. However, we can further improve both the code coverage and the validity rate by using our instanceof 88.00% 0.08% 0.85% 1.86% synchronized 0.16% 94.80% 0.16% 0.85% autoprompting stage to distill the user input into a concise but in- finally 0.51% 3.17% 78.62% 0.82% formative prompt (autoprompt), instead of using the raw user input. Coverage 14,546 13,972 13,203 16,128 Directly using the user-provided input may include information Qiskit targeted campaign (APIs) that is irrelevant for fuzzing, leading to both a lower validity rate switch for loop linear General (as the generation LLM may struggle to understand the raw doc- switch 71.76% 0.00% 0.00% 0.00% umentation) and lower coverage (since, unlike our autoprompting for loop 0.17% 75.97% 0.00% 0.00% generated prompt, the raw documentation is not designed to be linear 0.00% 0.00% 54.79% 0.00% used for LLM generation). Coverage 30,597 26,703 29,535 33,853 an input prompt that describes a particular feature. Furthermore, we observe that fuzzing on features that are related can lead to a moderately high cross-feature hit rate (i.e., hit rate of feature X on fuzzing run for feature Y). For example, the C keywords typedef 5.3.2 Fuzzing loop. Next, we examine the different variants of our and union are both related to type operations, and hence, their fuzzing loop setup by keeping the initial prompt the same (by using the default autoprompting): 1) w/o example: does not select an exam- cross-feature hit rate is high compared to an unrelated feature, such ple during the fuzzing loop (i.e., it continuously samples from the as goto. As shown in Table 3, a general fuzzing approach, while same initial prompt), 2) w/ example: selects an example but only uses achieving the highest overall code coverage, can be extremely inef- the generate-new instruction , 3) Fuzz4All: the full approach with ficient in targeting a specific feature (average 96.0% reduction in hit all generation strategies used. We first observe that by only sam- rate compared with Fuzz4All’s targeted fuzzing). For example, in pling from the same input (w/o example), LLMs will often repeatedly Qiskit, the general fuzzing campaign has a 0% hit rate of the three generate the same or similar fuzzing inputs. On average, 8.0% of target features. This can be explained by the fact that these features the fuzzing inputs generated are repeated in w/o example compared were added recently to Qiskit and not yet widely used, thus being to only 4.7% when using the full Fuzz4All approach. Adding an extremely rare in the LLM training data. However, by providing example to the input prompt (w/ example) avoids sampling from suitable user input during the targeted fuzzing campaign, Fuzz4All the same distribution and improves both coverage and validity can successfully generate fuzzing inputs that use these new features. rate. Finally, the full Fuzz4All approach achieves the highest cov- This ability of Fuzz4All will be valuable to developers who want erage across all SUTs. Compared to the w/ example variant (the to test novel features or components of a SUT. second-best), the full Fuzz4All adds additional generation strate- gies, semantic-equiv and mutate-existing, which help to further provide useful instructions to the generation LLM. Hit rate Hit rate Hit rate Hit rate Hit rate Hit rate Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang Table 4: Effectiveness of variants (* indicates statistically significant coverage improvement compared w/ 2nd best variant). C C++ SMT Go Java Qiskit Variants Description Cov. % valid Cov. % valid Cov. % valid Cov. % valid Cov. % valid Cov. % valid no input no initial prompt 127,261 42.57% 181,493 51.63% 50,838 49.49% 35,765 39.54% 14,374 50.25% 31,701 34.63% raw prompt use user-provided input 137,204 33.95% 189,030 33.79% 49,697 39.49% 36,168 16.84% 15,445 37.64% 31,922 22.74% autoprompt apply autoprompting 182,530 39.09% 190,318 36.62% 51,496 45.04% 36,732 24.87% 15,838 45.54% 32,691 29.12% w/o example generate-new w/o example 143,349 34.23% 190,288 28.25% 50,089 18.41% 35,839 19.38% 15,444 44.69% 32,663 24.04% w/ example generate-new w/ example 182,530 39.09% 190,318 36.62% 51,496 45.04% 36,732 24.87% 15,838 45.54% 32,691 29.12% Fuzz4All all strategies w/ example 185,491 40.58% *193,845 41.22% *53,069 50.06% *37,981 32.00% *16,209 50.99% *33,913 27.45% Table 5: Summary of Fuzz4All-detected bugs. The developers have already confirmed and fixed this bug. Interest- ingly, they even added a slightly modified version of our submitted Confirmed Total Pending Won’t fix code snippet to the official test suite of GCC. Unknown Known Figure 5b shows a bug found in Clang, where the invalid code GCC 22 10 6 6 0 leads to a segmentation fault. Fuzz4All uses an unusual syntax Clang 20 13 7 0 0 for function declaration (i.e., auto x (...) -> return_type ), which CVC5 6 4 2 0 0 Z3 12 10 0 0 2 makes use of the decltype operation in C++. However, the bug Go 4 2 2 0 0 occurs when the throw statement inside of the decltype is evalu- Java 1 0 0 1 0 ated first, skipping the evaluation of the return type since throw Qiskit 11 8 2 1 0 exits the scope early and crashes Clang. This code, while invalid, Total 76 47 19 8 2 is still useful to reveal a bug in the Clang frontend as confirmed by developers. Additionally, prior fuzzing tools can hardly find this #include <optional> bug since they typically focus on generating valid code only and void y(stdoptional< int> z) noexcept(noexcept(stdoptional< int>{z})) {} do not handle the especially difficult-to-model decltype function. (a) GCC bug: Internal compiler error (segmentation fault) Figure 5c shows a bug found in Go where a nil input causes a #include <iostream> segmentation fault instead of producing a useful failure message. using E = stdnumeric_limits< int>; This bug is found by targeting the runtime Go standard library, auto fail(E e) decltype( throw e, void()) { throw e; } where we provide the documentation, which includes the descrip- (b) Clang bug: Segmentation fault tion of the ReadMemStats function. The bug has been confirmed and package main import ("runtime") fixed by the developers. While this bug might look simple (invoking func main() { runtime.ReadMemStats(nil) } a singular function), it cannot be found by the go-fuzz baseline (c) Go bug: Segmentation violation simply because go-fuzz requires manually written templates to tar- from qiskit import QuantumCircuit, ClassicalRegister get specific libraries, and runtime is not a part of any such template. crz = ClassicalRegister(1, name="crz") qc = QuantumCircuit(crz) With Fuzz4All, users can directly target any Go standard libraries qc.qasm(filename="my.qasm") by providing relevant input information (e.g., documentation). QuantumCircuit.from_qasm_file("my.qasm") Figure 5d shows a bug found in Qiskit’s QASM exporter. A quan- (d) Qiskit bug: Crash tum program, represented by the qc variable, is exported to QASM, Figure 5: Exemplary bugs found by Fuzz4All. a low level representation, silently generating an invalid output file, 5.4 RQ4: Bug Finding which leads to a crash when being reimported. The problem is that Table 5 summarizes the bugs found by Fuzz4All on our nine stud- the exporter represents the register in QASM using its name as iden- ied SUTs. In total, Fuzz4All detects 76 bugs, with 47 bugs already tifier, i.e., "crz", which also is the name of a well-known operation confirmed by developers as previously unknown. These results not of the QASM language, thus making the generated code ambiguous. only demonstrate the practical effectiveness of Fuzz4All in finding Note that prior work [56] could not find this bug because they large amounts of bugs but also the promised generality of Fuzz4All use pre-defined templates with only anonymous registers, whereas across languages and SUTs. Fuzz4All effectively leverages the quantum knowledge of LLMs to inject a meaningful string literal for detecting this bug. 5.4.1 Examples. Figure 5a shows a bug found in GCC when using noexcept(x), a C++ feature that specifies a function is non-throwing if x evaluates to true. In this example bug, Fuzz4All generates a rather complex code using std::optional, which indicates that a 6 THREATS TO VALIDITY particular value may or may not be present at runtime. While this Internal. The main internal threat comes from the implementa- code is valid and should compile correctly, this combination of dif- tion of Fuzz4All. To address this, we performed code reviews and ficult runtime dependencies cause GCC to crash with an internal testing to ensure correctness. Furthermore, we run each baseline compiler error. We note that this bug cannot be found by prior from their provided replication package whenever possible. techniques since they simply do not support the noexcept feature. External. The main external threat is our evaluation targets. To support our generality claim, we apply Fuzz4All on nine different The impact of additional generation strategies can be found in Section 5.3.2. 2 SUTs across six languages. Additionally, to account for variance Note that autoprompt and w/ example are the same variant, but we include them separately for ease of comparison. in long fuzzing runs, we repeat the 24-hour fuzzing campaign five Fuzzing Auto loop prompt. Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA times and check for statistically significant results. Since the gen- [13] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, eration LLM leverages the knowledge acquired during its training et al. 2021. Evaluating large language models trained on code. arXiv preprint done within the last year, reapplying Fuzz4All using the exact arXiv:2107.03374 (2021). [14] Yongheng Chen, Rui Zhong, Hong Hu, Hangfan Zhang, Yupeng Yang, Dinghao checkpoint of the LLM (StarCoder) used in this work might degrade Wu, and Wenke Lee. 2021. One engine to fuzz’em all: Generic language processor the effectiveness in the future due to data-shift. Fuzz4All can mit- testing with semantic validation. In 2021 IEEE Symposium on Security and Privacy igate this using the autoprompting step where more up-to-date (SP). IEEE, 642–658. [15] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav documentation/example code allows the model to also generate Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- up-to-date fuzzing inputs. One additional threat comes from the tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, use of the distillation LLM to generate the initial inputs, where Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, the LLM may “hallucinate”, i.e., produce made-up or inaccurate Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay information [31] . This limitation is common to most pipelines that Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek use LLMs, and we hope to address it in our future work. Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr 7 CONCLUSION Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, We present Fuzz4All, a universal fuzzer leveraging LLMs to sup- Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Modeling port both general and targeted fuzzing of arbitrary SUTs that take with Pathways. arXiv:2204.02311 [cs.CL] in a multitude of programming languages. Fuzz4All uses a novel [16] Andrew W. Cross, Lev S. Bishop, John A. Smolin, and Jay M. Gambetta. 2017. Open Quantum Assembly Language. arXiv:1707.03429 [quant-ph] (July 2017). autoprompting stage to produce input prompts that concisely sum- arXiv:1707.03429 [quant-ph] marize the user-provided inputs. In its fuzzing loop, Fuzz4All [17] Chris Cummins, Pavlos Petoumenos, Alastair Murray, and Hugh Leather. 2018. Compiler fuzzing through deep learning. In Proceedings of the 27th ACM SIGSOFT iteratively updates the initial input prompt with both code exam- International Symposium on Software Testing and Analysis. 95–105. ples and generation strategies aimed at producing diverse fuzzing [18] Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Ling- inputs. Evaluation results on nine different SUTs across six differ- ming Zhang. 2023. Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. In ISSTA 2023. 423–435. ent languages demonstrate that Fuzz4All is able to significantly [19] Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing improve coverage compared to state-of-the-art tools. Furthermore, Yang, and Lingming Zhang. 2023. Large language models are edge-case fuzzers: Fuzz4All is able to detect 76 bugs with 47 already confirmed by Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014 (2023). [20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: developers as previously unknown. Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [21] Karine Even-Mendoza, Cristian Cadar, and Alastair F Donaldson. 2022. REFERENCES CsmithEdge: more effective compiler testing by handling undefined behaviour [1] 2021. Qiskit/Qiskit. https://github.com/Qiskit/qiskit. less conservatively. Empirical Software Engineering 27, 6 (2022), 129. [2] 2023. std::expected. https://en.cppreference.com/w/cpp/utility/expected. [22] Karine Even-Mendoza, Arindam Sharma, Alastair F. Donaldson, and Cristian [3] Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig, Cadar. 2023. GrayC: Greybox Fuzzing of Compilers and Analysers for C (ISSTA Ahmad-Reza Sadeghi, and Daniel Teuchert. 2019. NAUTILUS: Fishing for Deep 2023). Association for Computing Machinery, New York, NY, USA, 1219–1231. Bugs with Grammars.. In NDSS. https://doi.org/10.1145/3597926.3598130 [4] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan [23] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and Pre-Trained Model for Programming and Natural Languages. arXiv:2002.08155. interactivity. arXiv preprint arXiv:2302.04023 (2023). [24] Mark Fingerhuth, Tomáš Babej, and Peter Wittek. 2018. Open Source Soft- [5] Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel. ware in Quantum Computing. PLOS ONE 13, 12 (Dec. 2018), e0208561. 2022. Code Generation Tools (Almost) for Free? A Study of Few-Shot, https://doi.org/10.1371/journal.pone.0208561 Pre-Trained Language Models on Code. CoRR abs/2206.01335 (2022). [25] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, https://doi.org/10.48550/arXiv.2206.01335 arXiv:2206.01335 Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A [6] Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. 2020. Fuzzing: generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 Challenges and reflections. IEEE Software 38, 3 (2020), 79–86. (2022). [7] Marcel Böhme, László Szekeres, and Jonathan Metzman. 2022. On the reliability [26] go-fuzz 2023. go-fuzz: randomized testing for Go. https://github.com/dvyukov/go- of coverage-based fuzzer benchmarking. In ICSE 2022. 1621–1633. fuzz. [8] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra- [27] Patrice Godefroid, Hila Peleg, and Rishabh Singh. 2017. Learn&fuzz: Machine fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, learning for input fuzzing. In ASE 2017. IEEE, 50–59. Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon [28] gpt4endpoint 2023. Models - GPT-4. https://platform.openai.com/docs/models/ Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher gpt- 4. Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack [29] Alex Groce, Rijnard van Tonder, Goutamkumar Tulajappa Kalburgi, and Claire Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Le Goues. 2022. Making no-fuss compiler fuzzing effective. In Proceedings of the Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165. 31st ACM SIGPLAN International Conference on Compiler Construction. 194–204. [9] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric [30] Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Foundations and Trends® in Programming Languages 4, 1-2 (2017), 1–119. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv [31] Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos. 2022. A survey on preprint arXiv:2303.12712 (2023). automated fact-checking. Transactions of the Association for Computational [10] Alexander Bulekov, Bandan Das, Stefan Hajnoczi, and Manuel Egele. 2023. No Linguistics 10 (2022), 178–206. Grammar, No Problem: Towards Fuzzing the Linux Kernel without System-Call [32] Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with code Descriptions. In Network and Distributed System Security (NDSS) Symposium 2023. fragments. In 21st USENIX Security Symposium (USENIX Security 12). 445–458. [11] Stefanos Chaliasos, Thodoris Sotiropoulos, Diomidis Spinellis, Arthur Gervais, [33] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The Curious Benjamin Livshits, and Dimitris Mitropoulos. 2022. Finding typing compiler bugs. Case of Neural Text Degeneration. arXiv:1904.09751. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming [34] Bo Jiang, Xiaoyan Wang, Wing Kwong Chan, TH Tse, Na Li, Yongfeng Yin, and Language Design and Implementation. 183–198. Zhenyu Zhang. 2020. Cudasmith: A fuzzer for CUDA compilers. In 2020 IEEE [12] Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan 44th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, Hao, and Lu Zhang. 2020. A survey of compiler testing. ACM Computing Surveys 861–871. (CSUR) 53, 1 (2020), 1–36. Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang [35] jsfunfuzz 2017. Introducing jsfunfuzz. https://www.squarefree.com/2007/08/ [63] John Schulman, Barret Zoph, Jacob Hilton Christina Kim, Jacob Menick, Jiayi 02/introducing- jsfunfuzz/. Weng, Juan Felipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, [36] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rapha Gontijo Lopes, Shengjia Zhao, Arun Vijayvergiya, Eric Sigler, Adam Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Perelman, Chelsea Voss, Mike Heaton, Joel Parish, Dave Cummings, Rajeev Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020). Nayak, Valerie Balcom, David Schnurr, Tomer Kaftan, Chris Hallacy, Nicholas [37] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Turley, Noah Deutsch, Vik Goel, Jonathan Ward, Aris Konstantinidis, Wojciech Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Zaremba, Long Ouyang, Leonard Bogdonoff, Joshua Gross, David Medina, Sarah Computer and Communications Security (CCS ’18). Association for Computing Ma- Yoo, Teddy Lee, Ryan Lowe, Dan Mossing, Joost Huizinga, Roger Jiang, Carroll chinery, New York, NY, USA, 2123–2138. https://doi.org/10.1145/3243734.3243804 Wainwright, Diogo Almeida, Steph Lin, Marvin Zhang, Kai Xiao, Katarina Slama, [38] Suyoung Lee, HyungSeok Han, Sang Kil Cha, and Sooel Son. 2020. Montage: A Steven Bills, Alex Gray, Jan Leike, Jakub Pachocki, Phil Tillet, Shantanu Jain, Greg Neural Network Language{ Model-Guided}{JavaScript} Engine Fuzzer. In 29th Brockman, and Nick Ryder. 2022. ChatGPT: Optimizing Language Models for USENIX Security Symposium (USENIX Security 20). 2613–2630. Dialogue. (2022). https://openai.com/blog/chatgpt/. [39] Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha [64] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. Adaptive Test Sen. 2023. CODAMOSA: Escaping Coverage Plateaus in Test Generation with Generation Using a Large Language Model. arXiv:2302.06527 [cs.SE] Pre-trained Large Language Models. In ICSE 2023. [65] Kensen Shi, David Bieber, and Rishabh Singh. 2022. Tf-coder: Program synthesis [40] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman for tensor manipulations. ACM Transactions on Programming Languages and Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising Systems (TOPLAS) 44, 2 (2022), 1–36. sequence-to-sequence pre-training for natural language generation, translation, [66] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. and comprehension. arXiv preprint arXiv:1910.13461 (2019). 2020. Autoprompt: Eliciting knowledge from language models with automatically [41] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, generated prompts. arXiv preprint arXiv:2010.15980 (2020). Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. [67] Michael Sutton, Adam Greene, and Pedram Amini. 2007. Fuzzing: Brute Force StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023). Vulnerability Discovery. Addison-Wesley Professional. [42] Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous [68] syzkaller 2023. syzkaller - kernel fuzzer. https://github.com/google/syzkaller. prompts for generation. arXiv preprint arXiv:2101.00190 (2021). [69] Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Srivastava, and Colin [43] libFuzzer 2023. libFuzzer – a library for coverage-guided fuzz testing. Raffel. 2021. Improving and simplifying pattern exploiting training. arXiv preprint https://llvm.org/docs/LibFuzzer.html. arXiv:2103.11955 (2021). [44] Christopher Lidbury, Andrei Lascu, Nathan Chong, and Alastair F Donaldson. [70] TensorFlow 2023. TensorFlow. https://www.tensorflow.org. 2015. Many-core compiler fuzzing. ACM SIGPLAN Notices 50, 6 (2015), 65–76. [71] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, [45] Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Jinyang Li, Aurojit Panda, and Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you Lingming Zhang. 2023. Nnsmith: Generating diverse and valid test cases for deep need. Advances in neural information processing systems 30 (2017). learning compilers. In ASPLOS 2023, Volume 2. 530–543. [72] Vasudev Vikram, Caroline Lemieux, and Rohan Padhye. 2023. Can Large Language [46] Jiawei Liu, Yuxiang Wei, Sen Yang, Yinlin Deng, and Lingming Zhang. 2022. Models Write Good Property-Based Tests? arXiv preprint arXiv:2307.04346 (2023). Coverage-guided tensor compiler fuzzing with joint ir-pass mutation. Proceedings [73] Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and of the ACM on Programming Languages 6, OOPSLA1 (2022), 1–26. Michael R Lyu. 2022. No more fine-tuning? an experimental evaluation of prompt [47] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and tuning in code intelligence. In ESEC/FSE 2022. 382–394. Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of [74] Anjiang Wei, Yinlin Deng, Chenyuan Yang, and Lingming Zhang. 2022. Free Prompting Methods in Natural Language Processing. CoRR abs/2107.13586 (2021). lunch for testing: Fuzzing deep-learning libraries from open source. In ICSE 2022. arXiv:2107.13586 https://arxiv.org/abs/2107.13586 995–1007. [48] Xiao Liu, Xiaoting Li, Rupesh Prajapati, and Dinghao Wu. 2019. Deepfuzz: [75] Dominik Winterer, Chengyu Zhang, and Zhendong Su. 2020. On the unusual Automatic generation of syntax valid c programs for fuzz testing. In Proceedings effectiveness of type-aware operator mutations for testing SMT solvers. Proc. of the AAAI Conference on Artificial Intelligence , Vol. 33. 1044–1051. ACM Program. Lang. 4, OOPSLA (2020), 193:1–193:25. [49] Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for [76] Dominik Winterer, Chengyu Zhang, and Zhendong Su. 2020. Validating SMT C and C++ compilers with YARPGen. Proceedings of the ACM on Programming Solvers via Semantic Fusion. In Proceedings of the 41st ACM SIGPLAN Conference Languages 4, OOPSLA (2020), 1–25. on Programming Language Design and Implementation. 718–730. [50] M. Zalewski 2016. American Fuzzy Lop - Whitepaper. https: [77] Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: //lcamtuf.coredump.cx/afl/technical_details.txt. Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint [51] Haoyang Ma. 2023. A Survey of Modern Compiler Fuzzing. arXiv preprint arXiv:2304.00385 (2023). arXiv:2306.06884 (2023). [78] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. [52] Henry B Mann and Donald R Whitney. 1947. On a test of whether one of A Systematic Evaluation of Large Language Models of Code (MAPS 2022). two random variables is stochastically larger than the other. The annals of Association for Computing Machinery, New York, NY, USA, 1–10. mathematical statistics (1947), 50–60. [79] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and [53] Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond J. Mooney, and Milos understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN Gligoric. 2023. Learning Deep Semantics for Test Completion. In 45th International conference on Programming language design and implementation. 283–294. Conference on Software Engineering. [80] Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, [54] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] and Xin Peng. 2023. No More Manual Tests? Evaluating and Improving ChatGPT [55] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela for Unit Test Generation. arXiv:2305.04207 [cs.SE] Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. [81] Shafiq Joty Yue Wang, Weishi Wang and Steven C.H. Hoi. 2021. CodeT5: Identifier- Training language models to follow instructions with human feedback. Advances aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and in Neural Information Processing Systems 35 (2022), 27730–27744. Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural [56] Matteo Paltenghi and Michael Pradel. 2023. MorphQ: Metamorphic Testing of Language Processing, EMNLP 2021. the Qiskit Quantum Computing Platform. In 2023 IEEE/ACM 45th International [82] Andreas Zeller, Rahul Gopinath, Marcel Böhme, Gordon Fraser, and Christian Conference on Software Engineering (ICSE). IEEE Computer Society, 2413–2424. Holler. 2019. The fuzzing book. https://doi.org/10.1109/ICSE48619.2023.00202 [83] Hui Zhao, Zhihui Li, Hansheng Wei, Jianqi Shi, and Yanhong Huang. 2019. [57] Jiwon Park, Dominik Winterer, Chengyu Zhang, and Zhendong Su. 2021. SeqFuzzer: An Industrial Protocol Fuzzing Framework from a Deep Learning Generative type-aware mutation for testing SMT solvers. Proceedings of the ACM Perspective. In 2019 12th IEEE Conference on Software Testing, Validation and on Programming Languages 5, OOPSLA (2021), 1–19. Verification (ICST) . 59–67. https://doi.org/10.1109/ICST.2019.00016 [58] Jibesh Patra and Michael Pradel. 2016. Learning to fuzz: Application-independent [84] Yingquan Zhao, Zan Wang, Junjie Chen, Mengdi Liu, Mingyuan Wu, Yuqun fuzz testing with probabilistic, generative models of input data. (2016). Zhang, and Lingming Zhang. 2022. History-Driven Test Program Synthesis [59] PyTorch 2023. PyTorch. http://pytorch.org. for JVM Testing. In Proceedings of the 44th International Conference on Software [60] Guanghui Qin and Jason Eisner. 2021. Learning How to Ask: Querying LMs Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). 1133–1144. with Mixtures of Soft Prompts. In Proceedings of the 2021 Conference of the [85] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, North American Chapter of the Association for Computational Linguistics: Human Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt Language Technologies (NAACL-HLT). engineers. arXiv preprint arXiv:2211.01910 (2022). [61] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. [86] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Improving language understanding by generative pre-training. (2018). Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-Tuning Language [62] Timo Schick and Hinrich Schütze. 2020. Exploiting cloze questions for few shot Models from Human Preferences. arXiv:1909.08593. text classification and natural language inference. arXiv preprint arXiv:2001.07676 (2020). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Computing Research Repository arXiv (Cornell University) http://www.deepdyve.com/lp/arxiv-cornell-university/universal-fuzzing-via-large-language-models-JdR3JzuAQS

Loading next page...

References (86)

Eric Horvitz (2009)
Association for Computing Machinery
Pengyu Nie, Rahul Banerjee, Junyi Li, R. Mooney, Miloš Gligorić (2023)
Learning Deep Semantics for Test Completion
2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, B. Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, M. Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, S. Ghemawat, Sunipa Dev, H. Michalewski, Xavier García, Vedant Misra, Kevin Robinson, L. Fedus, Denny Zhou, Daphne Ippolito, D. Luan, Hyeontaek Lim, Barret Zoph, A. Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew Dai, T. Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Díaz, Orhan Firat, Michele Catasta, Jason Wei, K. Meier-Hellstern, D. Eck, J. Dean, Slav Petrov, Noah Fiedel (2022)
PaLM: Scaling Language Modeling with Pathways
J. Mach. Learn. Res., 24
V. Livinskii, Dmitry Babokin, J. Regehr (2020)
Random testing for C and C++ compilers with YARPGen
Proceedings of the ACM on Programming Languages, 4
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2019)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet Do, Yan Xu, Pascale Fung (2023)
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
ArXiv, abs/2302.04023
M. Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdel-rahman Mohamed, Omer Levy, Veselin Stoyanov, Luke Zettlemoyer (2019)
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Max Schäfer, Sarah Nadi, A. Eghbali, Frank Tip (2023)
Adaptive Test Generation Using a Large Language Model
ArXiv, abs/2302.06527
Caroline Lemieux, J. Inala, Shuvendu Lahiri, S. Sen (2023)
CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models
2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)
Karine Even-Mendoza, Cristian Cadar, A. Donaldson (2022)
CsmithEdge: more effective compiler testing by handling undefined behaviour less conservatively
Empirical Software Engineering, 27
Yongheng Chen, Rui Zhong, Hong Hu, Hangfan Zhang, Yupeng Yang, Dinghao Wu, Wenke Lee (2021)
One Engine to Fuzz ’em All: Generic Language Processor Testing with Semantic Validation
2021 IEEE Symposium on Security and Privacy (SP)
Xuejun Yang, Yang Chen, E. Eide, J. Regehr (2011)
Finding and understanding bugs in C compilers
Yinlin Deng, Chun Xia, Chenyuan Yang, Shizhuo Zhang, Shujing Yang, Lingming Zhang (2023)
Large Language Models are Edge-Case Fuzzers: Testing Deep Learning Libraries via FuzzGPT
ArXiv, abs/2304.02014
Yue Wang, Weishi Wang, Shafiq Joty, S. Hoi (2021)
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
ArXiv, abs/2109.00859
Stefanos Chaliasos, Thodoris Sotiropoulos, D. Spinellis, Arthur Gervais, B. Livshits, Dimitris Mitropoulos (2022)
Finding typing compiler bugs
Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, M. Lewis (2022)
InCoder: A Generative Model for Code Infilling and Synthesis
ArXiv, abs/2204.05999
Bo Jiang, Xiaoyan Wang, W. Chan, T. Tse, Na Li, Yongfeng Yin, Zhenyu Zhang (2020)
CUDAsmith: A Fuzzer for CUDA Compilers
2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC)
Alec Radford, Karthik Narasimhan (2018)
Improving Language Understanding by Generative Pre-Training
(2019)
Fine-TuningLanguage ModelsfromHumanPreferences
Yongchao Zhou, Andrei Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba (2022)
Large Language Models Are Human-Level Prompt Engineers
ArXiv, abs/2211.01910
libFuzzer 2023. libFuzzer – a library for coverage-guided fuzz testing
J. Kaplan, Sam McCandlish, T. Henighan, Tom Brown, Benjamin Chess, Rewon Child, S. Gray, Alec Radford, Jeff Wu, Dario Amodei (2020)
Scaling Laws for Neural Language Models
ArXiv, abs/2001.08361
Dominik Winterer, Chengyu Zhang, Z. Su (2020)
Validating SMT solvers via semantic fusion
Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation
Jiawei Liu, Yuxiang Wei, Sen Yang, Yinlin Deng, Lingming Zhang (2022)
Coverage-guided tensor compiler fuzzing with joint IR-pass mutation
Proceedings of the ACM on Programming Languages, 6
TensorFlow2023
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, S. Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, F. Such, D. Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Guss, Alex Nichol, Igor Babuschkin, S. Balaji, Shantanu Jain, A. Carr, J. Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, M. Knight, Miles Brundage, Mira Murati, Katie Mayer, P. Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba (2021)
Evaluating Large Language Models Trained on Code
ArXiv, abs/2107.03374
M. Sutton, Adam Greene, P. Amini (2007)
Fuzzing: Brute Force Vulnerability Discovery
Chun Xia, Lingming Zhang (2023)
Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT
ArXiv, abs/2304.00385
Christopher Lidbury, Andrei Lascu, Nathan Chong, A. Donaldson (2015)
Many-core compiler fuzzing
Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
Suyoung Lee, HyungSeok Han, S. Cha, Sooel Son (2020)
Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer
Christian Holler, Kim Herzig, A. Zeller (2012)
Fuzzing with Code Fragments
Karine Even-Mendoza, Arindam Sharma, A. Donaldson, Cristian Cadar (2023)
GrayC: Greybox Fuzzing of Compilers and Analysers for C
Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis
(2017)
Programsynthesis
Alex Groce, Rijnard Tonder, G. Kalburgi, Claire Goues (2022)
Making no-fuss compiler fuzzing effective
Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou (2020)
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
ArXiv, abs/2002.08155
Xiao Liu, Xiaoting Li, Rupesh Prajapati, Dinghao Wu (2019)
DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing
M. Fingerhuth, Tomás Babej, P. Wittek (2018)
Open source software in quantum computing
PLoS ONE, 13
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, Rewon Child, A. Ramesh, Daniel Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S. Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (2020)
Language Models are Few-Shot Learners
ArXiv, abs/2005.14165
Dominik Winterer, Chengyu Zhang, Z. Su (2020)
On the unusual effectiveness of type-aware operator mutations for testing SMT solvers
Proceedings of the ACM on Programming Languages, 4
(2017)
Introducing jsfunfuzz
(2016)
American Fuzzy Lop - Whitepaper. https: //lcamtuf.coredump.cx/afl/technical_details.txt
Zhijiang Guo, M. Schlichtkrull, Andreas Vlachos (2021)
A Survey on Automated Fact-Checking
Transactions of the Association for Computational Linguistics, 10
(2022)
ChatGPT: Optimizing Language Models for Dialogue
Alexander Bulekov, Bandan Das, Stefan Hajnoczi, Manuel Egele (2023)
No Grammar, No Problem: Towards Fuzzing the Linux Kernel without System-Call Descriptions
Proceedings 2023 Network and Distributed System Security Symposium
Haoyang Ma (2023)
A Survey of Modern Compiler Fuzzing
ArXiv, abs/2306.06884
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Y. Lee, Yuan-Fang Li, Scott Lundberg, Harsha Nori, H. Palangi, Marco Ribeiro, Yi Zhang (2023)
Sparks of Artificial General Intelligence: Early experiments with GPT-4
ArXiv, abs/2303.12712
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi (2019)
The Curious Case of Neural Text Degeneration
ArXiv, abs/1904.09751
Marcel Böhme, László Szekeres, Jonathan Metzman (2022)
On the Reliability of Coverage-Based Fuzzer Benchmarking
2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
Guanghui Qin, J. Eisner (2021)
Learning How to Ask: Querying LMs with Mixtures of Soft Prompts
ArXiv, abs/2104.06599
(2020)
Autoprompt:Elicitingknowledgefromlanguagemodelswithautomaticallygeneratedprompts
Kensen Shi, David Bieber, Rishabh Singh (2020)
TF-Coder: Program Synthesis for Tensor Manipulations
ACM Transactions on Programming Languages and Systems (TOPLAS), 44
Hui Zhao, Zhihui Li, Hansheng Wei, Jianqi Shi, Yanhong Huang (2019)
SeqFuzzer: An Industrial Protocol Fuzzing Framework from a Deep Learning Perspective
2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)
2023. GrayC:
Patrick Bareiss, Beatriz Souza, Marcelo d’Amorim, Michael Pradel (2022)
Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code
ArXiv, abs/2206.01335
Andrew Cross, L. Bishop, J. Smolin, J. Gambetta (2017)
Open Quantum Assembly Language
arXiv: Quantum Physics
Xiang Li, Percy Liang (2021)
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), abs/2101.00190
Frank Xu, Uri Alon, Graham Neubig, V. Hellendoorn (2022)
A systematic evaluation of large language models of code
Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming
go-fuzz2023
Yingquan Zhao, Zan Wang, Junjie Chen, Mengdi Liu, Mingyuan Wu, Yuqun Zhang, Lingming Zhang (2022)
History-Driven Test Program Synthesis for JVM Testing
2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
H. Mann, D. Whitney (1947)
On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other
Annals of Mathematical Statistics, 18
Yinlin Deng, Chun Xia, Haoran Peng, Chenyuan Yang, Lingming Zhang (2022)
Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models
Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis
(2019)
The fuzzing book
OpenAI
Anjiang Wei, Y. Deng, Chenyuan Yang, Lingming Zhang (2022)
Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source
2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig (2021)
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
ACM Computing Surveys, 55
Conference’17,July2017,Washington,DC,USA
Vasudev Vikram, Caroline Lemieux, Rohan Padhye (2023)
Can Large Language Models Write Good Property-Based Tests?
ArXiv, abs/2307.04346
Chris Cummins, Pavlos Petoumenos, A. Murray, Hugh Leather (2018)
Compiler fuzzing through deep learning
Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis
Matteo Paltenghi, Michael Pradel (2022)
MorphQ: Metamorphic Testing of the Qiskit Quantum Computing Platform
2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)
(2017)
Program synthesis. Foundations and Trends® in Programming Languages
Marcel Boehme, Cristian Cadar, Abhik Roychoudhury (2020)
Fuzzing: Challenges and Reflections
IEEE Software, 38
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, Illia Polosukhin (2017)
Attention is All you Need
Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng (2023)
No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation
ArXiv, abs/2305.04207
Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Jinyang Li, Aurojit Panda, Lingming Zhang (2022)
NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, Michael Lyu (2022)
No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence
Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Derek Tam, Rakesh Menon, Mohit Bansal, Shashank Srivastava, Colin Raffel (2021)
Improving and Simplifying Pattern Exploiting Training
ArXiv, abs/2103.11955
Timo Schick, Hinrich Schütze (2020)
Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference
gpt4endpoint2023
Jibesh Patra, Michael Pradel (2016)
Learning to Fuzz: Application-Independent Fuzz Testing with Probabilistic, Generative Models of Input Data
Patrice Godefroid, Hila Peleg, Rishabh Singh (2017)
Learn&Fuzz: Machine learning for input fuzzing
2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)
Jiwon Park, Dominik Winterer, Chengyu Zhang, Z. Su (2021)
Generative type-aware mutation for testing SMT solvers
Proceedings of the ACM on Programming Languages, 5
Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig, A. Sadeghi, D. Teuchert (2019)
NAUTILUS: Fishing for Deep Bugs with Grammars
Proceedings 2019 Network and Distributed System Security Symposium
Junjie Chen, Jibesh Patra, Michael Pradel, Y. Xiong, Hongyu Zhang, Dan Hao, Lu Zhang (2020)
A Survey of Compiler Testing
ACM Computing Surveys (CSUR), 53
George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, M. Hicks (2018)
Evaluating Fuzz Testing
Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, J. Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, P. Welinder, P. Christiano, J. Leike, Ryan Lowe (2022)
Training language models to follow instructions with human feedback
ArXiv, abs/2203.02155
Raymond Li, Loubna Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, J. Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, J. Stillerman, S. Patel, Dmitry Abulkhanov, M. Zocca, Manan Dey, Zhihan Zhang, N. Fahmy, Urvashi Bhattacharyya, W. Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, M. Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jana Ebert, Tri Dao, Mayank Mishra, A. Gu, Jennifer Robinson, Carolyn Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro Werra, Harm Vries (2023)
StarCoder: may the source be with you!
ArXiv, abs/2305.06161

eISSN: ARCH-3344
DOI: 10.48550/arxiv.2308.04748
Publisher site: See Article on Publisher Site

Abstract

Chunqiu Steven Xia Matteo Paltenghi University of Illinois University of Urbana-Champaign, USA Stuttgart, Germany [email protected] [email protected] Jia Le Tian Michael Pradel Lingming Zhang University of Illinois University of University of Illinois Urbana-Champaign, USA Stuttgart, Germany Urbana-Champaign, USA [email protected] [email protected] [email protected] ABSTRACT in compilers and runtime engines is crucial because they can affect all corresponding downstream applications. Fuzzing has achieved tremendous success in discovering bugs and Traditional fuzzers can be categorized as generation-based [35, vulnerabilities in various software systems. Systems under test 49, 79] or mutation-based [22, 32, 67]. Generation-based fuzzers (SUTs) that take in programming or formal language as inputs, aim to directly synthesize complete code snippets, e.g., using a pre- e.g., compilers, runtime engines, constraint solvers, and software defined grammar for the target language. Instead of synthesizing libraries with accessible APIs, are especially important as they are from scratch, mutation-based fuzzers apply mutation operators or fundamental building blocks of software development. However, transformation rules to a set of high quality fuzzing seeds. Unfor- existing fuzzers for such systems often target a specific language, tunately, both traditional fuzzing approaches face the following and thus cannot be easily applied to other languages or even other limitations and challenges: versions of the same language. Moreover, the inputs generated C1: Tight coupling with target system and language. Traditional by existing fuzzers are often limited to specific features of the in- fuzzers are often designed to target a specific language or a par- put language, and thus can hardly reveal bugs related to other or ticular SUT. However, designing and implementing a fuzzer is new features. This paper presents Fuzz4All, the first fuzzer that extremely time-consuming. For example, Csmith [79], a fuzzer is universal in the sense that it can target many different input for C/C++ compilers, has more than 80K lines of code, while Syz- languages and many different features of these languages. The key kaller [68], a fuzzer for Linux system calls, contains tens of thou- idea behind Fuzz4All is to leverage large language models (LLMs) sands of handcrafted rules [10] to generate and modify system calls. as an input generation and mutation engine, which enables the Because each target language is different, it is often non-trivial to approach to produce diverse and realistic inputs for any practi- reuse the effort of implementing a fuzzer from one input language cally relevant language. To realize this potential, we present a novel to another. Furthermore, fuzzing strategies that work well for one autoprompting technique, which creates LLM prompts that are well- SUT may not work at all for another one. suited for fuzzing, and a novel LLM-powered fuzzing loop, which C2: Lack of support for evolution. Real-world systems are con- iteratively updates the prompt to create new fuzzing inputs. We stantly evolving, e.g., by adding new features to the input language. evaluate Fuzz4All on nine systems under test that take in six differ- Traditional fuzzers designed for a specific version of a language ent languages (C, C++, Go, SMT2, Java and Python) as inputs. The or SUT may lose their effectiveness on a new version and cannot evaluation shows, across all six languages, that universal fuzzing be easily used to test newly implemented features. For example, achieves higher coverage than existing, language-specific fuzzers. Csmith supports only a limited set of features up to C++11, while Furthermore, Fuzz4All has identified 76 bugs in widely used sys- the C++ language has evolved significantly since then. In fact, re- tems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit cent work [21] shows that over a six-month fuzzing period, Csmith quantum computing platform, with 47 bugs already confirmed by was not able to uncover any new bugs in the latest releases of developers as previously unknown. popular GCC and Clang compilers, showing that new versions of compilers are becoming immune to existing fuzzers. 1 INTRODUCTION C3: Restricted generation ability. Even within the scope of a spe- cific target language, both generation-based and mutation-based Fuzz testing [67, 82], also known as fuzzing, is an automated testing fuzzing often are unable to cover a large part the input space. approach for generating inputs designed to expose unexpected be- Generation-based fuzzers rely heavily on an input grammar to haviors, e.g., crashes, of a system under test (SUT). Researchers and synthesize valid code, and additionally are equipped with semantic practitioners have successfully built practical fuzzing tools, which rules that ensure the validity of the synthesized code. To generate have shown great success in finding numerous bugs and vulnera- a high amount of valid fuzzing inputs or to side-step difficult-to- bilities in real-world systems [6]. A particularly important family model language features, generation-based fuzzers often use a sub- of SUTs are systems that take in programming or formal language set of the full language grammar, which limits them to test only a inputs, e.g., compilers, runtime engines, constraint solvers, and subset of all language features. Similarly, mutation-based fuzzers literally any libraries with accessible APIs. Numerous fuzzers have are limited by their mutation operators and require high quality been proposed for such systems since they are the fundamental seeds that can be difficult to obtain. building blocks for software development [12], e.g., finding bugs arXiv:2308.04748v1 [cs.SE] 9 Aug 2023 Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang Our Work. We present Fuzz4All, the first fuzzer that is universal in ★ Universal fuzzing. We introduce a new dimension for fuzzing the sense that it can target many different input languages and many that directly leverages the multi-lingual capabilities of LLMs to different features of theses languages. Our approach fundamentally fuzz-test many SUTs with a wide range of meaningful inputs. differs from existing general-purpose fuzzers, e.g., AFL [50] and ★ Autoprompting for fuzzing. We present a novel autoprompt- libFuzzer [43], which use extremely simple mutations, are unaware ing stage to support both general and targeted fuzzing by auto- of the target language, and therefore struggle to produce meaningful matically distilling user inputs into a prompt that is effective at programming language fuzzing inputs. Instead, our key idea is to generating inputs to the SUT. leverage a large language model (LLM) as an input generation and ★ LLM-powered fuzzing loop. We present an algorithm that con- mutation engine. Because LLMs are pre-trained on large amounts tinuously generates new fuzzing inputs by iteratively modifying of examples in various programming languages and other formal the prompt with selected examples and generation strategies. languages, they come with an implicit understanding of the syntax ★ Evidence of real-world effectiveness . We show across six pop- and semantics of these languages. Fuzz4All leverages this ability by ular languages and nine real-world SUTs (e.g., GCC, CVC5, Go, using an LLM as a universal input generation and mutation engine. javac, and Qiskit) that our approach significantly improves cover- The input to Fuzz4All are user-provided documents describing age compared to state-of-the-art fuzzers (avg. 36.8%) and detects the SUT, and optionally, specific features of the SUT to focus on, 76 bugs, with 47 already confirmed as previously unknown. e.g., in the form of documentation, example code, or formal specifi- ★ Continuous updating. We plan to continue to apply Fuzz4All cations. However, these user inputs may be too verbose to directly on additional targets and languages. Our code, dataset, and up- use as a prompt for the LLM. Instead of requiring the user to manu- to-date progress can be found at: https://fuzz4all.github.io ally engineer a prompt [47], which is time-consuming, we present an autoprompting step that automatically distills all user-provided 2 BACKGROUND & RELATED WORK inputs into a concise and effective prompt for fuzzing. This prompt 2.1 Large Language Models is the initial input to an LLM that generates fuzzing inputs. Since Recent developments in natural language processing (NLP) has continuously sampling with the same prompt would lead to many lead to the wide-spread adoption of large language models (LLMs) similar fuzzing inputs, we present an LLM-powered fuzzing loop, for both natural language [8] and code tasks [78]. State-of-the- which iteratively updates the prompt to generate a diverse set of art LLMs are based on transformers [71] and can be classified into fuzzing inputs. To this end, Fuzz4All combines fuzzing inputs gen- decoder-only (e.g., GPT3 [8] and StarCoder [41]), encoder-only (e.g., erated in previous iterations with natural language instructions, BERT [20] and CodeBERT [23]) and encoder-decoder (BART [40] e.g., asking to mutate these inputs. The LLM-generated fuzzing and CodeT5 [81]) models. More recently, instruction-based LLMs inputs are then passed to the SUT, which we validate against a (e.g., ChatGPT [63] and GPT4 [54]) and LLMs fine-tuned using re- user-provided test oracle, such as checking for system crashes. inforcement learning from human feedback (RLHF) [86] are shown Fuzz4All addresses the previously discussed limitations and to understand and follow complex instructions [4, 55, 63]. challenges of traditional fuzzers. Instead of meticulously designing LLMs are typically either fine-tuned [ 61] or prompted [47] to a single-purpose fuzzer for a specific SUT (C1), Fuzz4All, by using perform specific tasks. Fine-tuning updates the model weights an LLM as the generation engine, can be applied to a wide range of through further training on a task-specific dataset. However, suit- SUTs and input languages. Compared to existing fuzzers that target able datasets may be unavailable, and as LLM sizes continue to a specific version of the SUT or input language (C2), Fuzz4All grow [36], fine-tuning a large LLM is also increasingly expensive. can easily evolve with the target. For example, to fuzz-test a newly Prompting, on the other hand, does not require explicitly updating implemented feature, a user can simply provide documentation the model weights, but provides the LLM with a description of or example code related to that feature. To address the restricted the task, and optionally, a few examples of solving the task. The generation ability of traditional fuzzers (C3), Fuzz4All exploits the process of picking the input (i.e., prompt) is known as prompt en- fact that LLMs are pre-trained on billions of code snippets, enabling gineering [47], where a user tries different input instructions until them to create a wide range of examples that likely obey the syn- finding one that works well. Recently, researchers have proposed tactic and semantic constraints of the target language/SUT. Finally, autoprompting [66], an automatic process that uses LLM gradients Fuzz4All does not require any instrumentation of the SUT, making to select either soft prompts [42, 60], i.e., continuous vector embed- the approach easily applicable in practice. dings, or hard prompts [62, 69], i.e., natural language text. Even We perform an extensive evaluation on six input languages more recently, researchers have substituted gradient-based methods (C, C++, SMT, Go, Java, and Python) and nine SUTs. For each of by computing a proxy score of effectiveness [85]. them, we compare our approach against state-of-the-art generation- This work leverages LLMs for the important problem of fuzzing. based and mutation-based fuzzers. The results show that Fuzz4All Unlike traditional autoprompting and proxy-based approaches, our achieves the highest code coverage across all languages, improving autoprompting strategy directly synthesizes prompts using GPT4 the previous state-of-the-art coverage by 36.8%, on average. Ad- and scores them according to a fuzzing-specific goal. ditionally, we demonstrate that Fuzz4All supports both general fuzzing and fuzzing targeted at specific features of the SUT, which a 2.2 Fuzzing and Testing user decides upon by providing adequate input documents. Finally, Fuzz4All detects 76 bugs across our studied SUTs, with 47 already Fuzz testing aims to generate inputs that cause unexpected behav- confirmed by developers as previously unknown. iors of the SUT. Traditional fuzzers can be classified as generation- Contributions: This paper makes the following contributions: based [35, 49, 79] or mutation-based [22, 32, 67]. Generation-based Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA fuzzers create complete code snippets using pre-defined grammars SUTs. Furthermore, unlike existing techniques, which produce gen- and built-in knowledge of the semantics of the target language. eral fuzzing inputs in a particular language, Fuzz4All additionally Csmith [79] and YARPGen [49] hard-code language specifications supports targeted fuzzing, which can generate code snippets that to ensure the validity of generated code snippets to test C and C++ focus on selected features. compilers, respectively. jsfunfuzz [35] combines a language gram- In addition to fuzzing, LLMs have also been applied to the re- mar with historical bug-triggering code snippets to generate new in- lated problem of unit-test generation [5, 39, 53, 64, 72, 80]. Code- puts to test JavaScript engines. Generation-based fuzzers have also Mosa [39] interleaves traditional search-based software testing been used to test OpenCL [44], the JVM [11], CUDA [34] and deep with querying Codex to generate new unit-tests whenever a cover- learning compilers [45]. Mutation-based fuzzers [67] iteratively age plateau is reached. TestPilot [64] prompts Codex with method perform transformations on seeds to generate new fuzzing inputs. source code and example usages to generate unit-tests and to fix In addition to basic mutations, researchers have developed com- incorrectly generated tests. In contrast to these LLM-based test gen- plex transformations targeted at ensuring type consistency [11, 57], erators, which require a specific type of input (e.g., function source adding historical bug-triggering code snippets [32, 84], and cover- code) and only work for unit testing [53, 64], by using our novel age feedback [3, 22, 46]. To benefit from both generation and muta- autoprompting stage, Fuzz4All can take inputs in arbitrary formats tion, many fuzzers use a combination of both approaches [12, 51]. for both general and targeted fuzzing. Furthermore, such unit-test Different from the above fuzzers, which target specific SUTs or generators often require manual work to check/complete the tests as languages, another line of research is on general-purpose fuzzing. even state-of-the-art LLMs [15, 63] cannot always produce reliable AFL [50] and libFuzzer [43] are general-purpose fuzzers that use oracle. Instead, Fuzz4All leverages widely-used fuzzing oracles, genetic algorithms with a fitness function to prioritize fuzzing such as crashes, and is fully automated. inputs for further mutations that achieve new coverage. These mutations are unaware of the SUT and focus on byte-level transfor- 3 FUZZ4ALL APPROACH mations. That is, when applied on SUTs that receive programming languages as input, general-purpose fuzzers are extremely unlikely We present Fuzz4All, a universal fuzzer that leverages LLMs to to produce valid inputs. Recent work [29] has instead added regular support both general and targeted fuzzing of any SUTs that take in expression-based mutation operators to match common program- programming language input. Figure 1 provides an overview of our ming statements (e.g., change + to -). The simplicity of these mu- approach. Fuzz4All first takes in arbitrary user input that describes tation operators limits the ability of such fuzzers at covering new the fuzzing inputs to be generated, e.g., documentation of the SUT, code, especially in more complex languages, such as C [22, 29]. Poly- example code snippets, or specifications. As the user input may Glot [14] is another language-agnostic fuzzer, which first parses be long, redundant, and partially irrelevant, the approach distills the seed programs into a uniform intermediate representation using it into a concise but informative prompt for fuzzing. To this end, a language-specific grammar and then uses a set of mutation oper- Fuzz4All performs an autoprompting step (Section 3.1) by using a ators to generate new programs. While promising, PolyGlot still large, state-of-the-art distillation LLM to sample multiple different uses a limited set of mutations and cannot achieve the same level of candidate prompts 1 . Each candidate prompt is passed on to the coverage as fuzzers that are designed for a particular language [22]. generation LLM to generate code snippets (i.e., fuzzing inputs) 2 . To complement traditional fuzzing techniques and apply fuzzing Fuzz4All then selects the prompt that produces the highest quality to emerging domains, learning-based fuzzers have been proposed. fuzzing inputs 3 . Prior learning-based techniques mainly focus on training a neural Fuzz4All builds on two models, a distillation LLM that reduces network to generate fuzzing inputs. TreeFuzz [58] parses the train- the given user input and a generation LLM that creates the fuzzing ing corpus into a tree structure and through tree traversal, learns a inputs, to balance the trade-off between the costs and benefits differ- probabilistic, generative model that synthesizes new fuzzing inputs. ent LLMs provide. Because the distillation LLM needs to understand Deep learning models have been used to fuzz PDF parsers [27], and distill arbitrary user input, we use a high-end, large founda- OpenCL [17], C [48], network protocols [83], and JavaScript [38]. tional model with strong natural language understanding abilities. Very recently, researchers have also directly leveraged LLMs for However, directly using such a large model for input generation fuzzing specific libraries. TitanFuzz [18] uses Codex [13] to gen- would be inefficient due to the high inference cost of autoregressive erate seed programs and InCoder [25] to perform template-based generation. Instead, to perform efficient fuzzing, Fuzz4All uses a mutation for fuzzing deep learning libraries [59, 70]. FuzzGPT [19] smaller model as the generation LLM. While our approach is general is another LLM-based deep learning library fuzzer, which leverages across any pairs of distillation and generation LLMs, we implement historical bug-triggering code snippets to either prompt or directly Fuzz4All with state-of-the-art GPT4 [54] and StarCoder [41]. fine-tune LLMs towards generating more unusual code snippets Using the best prompt selected via autoprompting as the initial for more effective fuzzing. input prompt for the generation LLM, we then move on to the Unlike prior learning- and LLM-based fuzzers, Fuzz4All is eas- fuzzing loop (Section 3.2), where Fuzz4All continuously samples ily applicable across many programming languages. Prior work the generation LLM to generate fuzzing inputs 4 . To avoid gener- trains language-specific models or requires language-specific pars- ating many similar fuzzing inputs, Fuzz4All continuously updates ing. Even recent LLM-based techniques [18, 19] are designed specif- the input prompt in each iteration. Specifically, the approach selects ically for deep learning libraries with hand-crafted prompts or a previously generated input as an example 5 , which demonstrates mutation patterns, and therefore cannot be easily extended to other the kind of future inputs we want the model to generate. In addi- tion to the example, Fuzz4All also appends a generation instruction Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang import ("fmt" "math/big") std::expected func main() { (theory Ints The class template std::expected provides operands []float64{2.6, a way to store either of two values. An 2.5} :funs ((NUMERAL Int) object of std::expected at any given time for mode big.ToNearestEven; (- Int Int) Member types Definition mode big.ToPositiveInf; mode (- Int Int Int :left-assoc) value_type(c++23) T { (+ Int Int Int :left-assoc) error_type(c++23) E fmt.Printf(" %s", mode) (* Int Int Int :left-assoc) ... documentation example code specification System Under Test sample int main(){ prompts distillation 4 sample std::expected std::expected std::variant int main(){ provides a way to provides a way to ... LLM std::expected store either a ... store either a ... ... best prompt input prompt fuzzing inputs generation 3 score & LLM std::expected select prompt 6 5 provides a way to std::expected store either a ... int main(){ int main(){ update select provides a way to std::variant std::variant std::expected int main(){ int main(){ input code store either a ... ... ... std::expected std::expected provides a way to prompt snippet ... ... store either a ... code snippets generate-new candidate prompts int main(){ std::expected ... mutate-existing 2 sample selected code generation semantic-equiv snippet LLM generation strategies Autoprompting Fuzzing Loop Figure 1: Overview of Fuzz4All. to use a distillation LLM to generate prompts that distill the infor- Algorithm 1: Autoprompting for fuzzing mation provided by the user, we give the following autoprompting 1 Function Autoprompting: Input : userInput, numSamples instruction to the distillation LLM: “Please summarize the above Output: inputPrompt information in a concise manner to describe the usage and function- 2 greedyPrompt← M (userInput, APInstruction, temp=0) ality of the target”. LetM be the distillation LLM, userInput be 3 candidatePrompts← [ greedyPrompt ] the user input and APInstruction be the autoprompting instruction. 4 while | candidatePrompts | < numSamples do 5 prompt← M (userInput, APInstruction, temp=1) The prompt prompt generated can be formalized as the conditional 6 candidatePrompts← candidatePrompts + [ prompt ] probability:M (prompt| userInput, APInstruction) 7 inputPrompt← argmax Scoring (M (p), SUT) Fuzz4All first generates a candidate prompt using greedy sam- p∈candidatePrompts pling with temperature 0 (line 2). By first sampling with low temper- 8 return inputPrompt ature, the algorithm obtains a plausible solution with a high degree of confidence. This approach is commonly used in other domains, to the initial prompt, which guides the model toward generating e.g., program synthesis [13], where the greedy output is evaluated new fuzzing inputs 6 . This process is repeated while continuously first to check if it can solve the problem. The algorithm then moves passing the generated fuzzing inputs into the SUT and checking on to sampling with higher temperature to obtain more diverse its behavior against a user-defined oracle, such as crashes. prompts (line 5), as done in prior work [13, 77]. Compared to greedy, sampling with high temperature yields different prompts that can 3.1 Autoprompting each provide a unique distilled summary of the user input. Each The following presents the details of the first of two main steps of generated prompt is added to a list of candidate prompts (line 6), Fuzz4All, which distills the given user input via autoprompting until the algorithm reaches the desired number of candidates. into a prompt suitable for fuzzing. The user input may describe the To pick the best input prompt to be used in the fuzzing step, SUT in general, or particular feature of the SUT to be tested. As the algorithm evaluates each candidate prompt by performing a shown in Figure 1, user inputs may include technical documenta- small-scale fuzzing experiment. Specifically, the approach uses each tion, example code, specifications, or even combinations of different prompt as an input to the generation LLM to produce multiple code modalities. Unlike traditional fuzzers that require inputs to follow snippets per prompt. Fuzz4All then scores the generated code snip- a specific format, e.g., code snippets to use as seeds or well-formed pets for each prompt based on a scoring function. While the scoring specifications, Fuzz4All can directly understand the natural lan- function can be based on a variety of different metrics, e.g., cover- guage descriptions or code examples in the user input. However, age, bug finding, or the complexity of generated fuzzing inputs, to some information in the user input may be redundant or irrelevant, make the approach lightweight and general, our scoring function is and hence, directly using the user inputs as a prompt for the gen- the number of unique generated code snippets that are valid, i.e., ac- eration LLM may be ineffective, as confirmed by our ablation study cepted by the target SUT. This metric is chosen since for fuzzing, we in Section 5.3. Therefore, the goal of autoprompting is to generate want fuzzing inputs to be valid or close to valid to trigger logic deep a distilled input prompt that enables effective LLM-based fuzzing. inside the SUT. LetM be the generation LLM, p be a candidate 3.1.1 Autoprompting Algorithm. Algorithm 1 details Fuzz4All’s prompt, isValid be the function that returns 1 if a generated code autoprompting step. The inputs are the user input and the number cis valid and 0 if invalid. Our default scoring function is defined of candidate prompts to generate. The final output is the input as: [isValid(c, SUT)]. Finally, Fuzz4All selects the input c∈M (p) prompt selected to be used for the fuzzing campaign. As our goal is user inputs ... Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA The C++23 std::expected class template provides a way to store either an High level Algorithm 2: Fuzzing loop expected value of type T or an unexpected value of type E. It is useful for description handling functions that may return an error or a valid result. The stored value of feature is allocated directly within the storage occupied by the expected object, 1 Function FuzzingLoop: without dynamic memory allocation. Input : inputPrompt, timeBudget The template parameters are T (the expected value type) and E (the unexpected Descriptions Output: bugs value type). Both types must meet the Destructible requirements, and certain of the inputs types are not allowed. 2 genStrats← [ generate-new, mutate-existing, std::expected provides member functions for construction, destruction, assignment, and accessing the stored values. Observers like operator bool and semantic-equiv ] has_value can be used to check if the object contains an expected value. 3 fuzzingInputs← M (inputPrompt + generate-new) Functions like value, error, and value_or can be used to access the expected G or unexpected values. 4 bugs← Oracle (fuzzingInputs, SUT) Monadic operations like and_then, transform, or_else, and transform_error 5 while timeElapsed < timeBudget do allow chaining operations on expected values and handling errors in a Different functional manner. 6 example← sample (fuzzingInputs, SUT) usages of Modifiers like emplace and swap can be used to construct the expected value 7 instruction← sample (genStrats) target in-place or exchange the contents of expected objects. Non-member functions like operator:= and swap(std::expected) provide comparison and swapping 8 fuzzingInputs← M (inputPrompt + example + functionality. instruction) Helper classes like unexpected, bad_expected_access, and unexpect_t are used to represent unexpected values, exceptions, and in-place construction tags for 9 bugs← bugs + Oracle (fuzzingInputs, SUT) unexpected values in expected objects. 10 return bugs Figure 2: Autoprompting result for std::expected. sampling multiple times using the same input would produce the prompt with the highest score (line 7) as the initial input prompt to same or similar code snippets. For fuzzing, we aim to avoid such re- be used for fuzzing. In summary, our autoprompting step combines peated inputs and instead want to generate a diverse set of fuzzing both prompt generation and scoring, which allows Fuzz4All to au- inputs that cover new code and discover new bugs. To accomplish tomatically generate/select a prompt suitable for the fuzzing target. this goal, we exploit the ability of LLMs to utilize both examples 3.1.2 Example: Autoprompting. Figure 2 shows an example of an and natural language instructions to guide the generation. input prompt generated by our autoprompting algorithm. The ex- The high-level idea of the fuzzing loop is to continuously aug- ample is for fuzzing C++ compilers while focusing specifically on ment the original input prompt by selecting an example fuzzing std::expected, a new feature introduced in C++23. As the user input from previous iterations and by specifying a generation strat- input, we pass the original cppreference documentation [2] to egy. The goal of using an example is to demonstrate the kind of Fuzz4All, which spans multiple screen lengths with small tables code snippet we want the generation LLM to produce. The gener- and verbose descriptions (498 words, 3262 characters). In contrast, ation strategies are designed as instructions on what to do with the distilled input prompt created by the autoprompting algorithm the provided code example. These strategies are inspired by tradi- provides a more concise natural language description of the tar- tional fuzzers, mimicking their ability to synthesize new fuzzing geted feature (214 words, 1410 characters). The input prompt con- inputs (as in generation-based fuzzers) and to produce variants of tains a high-level description of how std::expected is to be used. previously generated inputs (as in mutation-based fuzzers). Before For example, the input prompt contains a concise sentence (high- each new iteration of the fuzzing loop, Fuzz4All appends both an lighted in orange) that summarizes the situations the feature is example and a generation strategy to the input prompt, enabling useful in. Additionally, the input prompt contains descriptions of the generation LLM to continuously create new fuzzing inputs. the inputs, as well as the different usages (i.e., member functions) of the feature. For example, functions and_then, transform, or_else, 3.2.1 Fuzzing Loop Algorithm. Algorithm 2 describes the fuzzing and transform_error have very similar descriptions in the original loop. The inputs are the initial input prompt and the fuzzing budget. documentation, which is repeated for each function. Instead, in the The final output is a set of bugs identified by the user-defined distilled input prompt, these functions are grouped together in a oracle. First, the algorithm initializes the generation strategies concise manner that still illustrates how they can be used. Using (generate-new, mutate-existing, and semantic-equiv), which will the distilled input prompt, Fuzz4All can generate fuzzing inputs be used to modify the input prompt during the fuzzing loop (line 2). that effectively target the std::expected feature of C++ compilers. Figure 3 (top-right) lists our three generation strategies along with 3.1.3 Comparison with Existing Autoprompting Techniques. To the their corresponding instructions. For the first invocation of the best of our knowledge, we are the first to automatically distill knowl- generation LLM, denoted with M , the algorithm does not yet edge from arbitrary user inputs for a software engineering task have any examples of fuzzing inputs. Hence, it appends to the input using black-box autoprompting. Compared to prior work on auto- prompt the generate-new generation instruction, which guides the prompting in NLP [66] and software engineering [73], which opti- model toward producing a first batch of fuzzing inputs (line 3). mize the prompt by accessing model gradients, our autoprompting Next, the algorithm enters the main fuzzing loop (lines 5–9), needs only black-box, sampling access to the distillation LLM. While which continuously updates the prompt to create new fuzzing in- the use of a scoring function to evaluate each prompt is similar to puts. To this end, the algorithm selects an example from the previous recent work in NLP [85], our scoring function directly evaluates batch of generated fuzzing inputs, randomly picking from all those the prompt on the exact downstream task of generating valid code fuzzing inputs that are valid for the SUT (line 6). In addition to the snippets, instead of using an approximate proxy scoring function. example, the algorithm also randomly picks one of the three gen- eration strategies (line 7). The generation strategy either instructs 3.2 Fuzzing Loop the model to mutate the selected example (mutate-existing), to Given the input prompt created in the first step of Fuzz4All, the produce a fuzzing input that is semantically equivalent to the ex- goal the fuzzing loop is to generate diverse fuzzing inputs using a ample (semantic-equiv), or to come up with a new fuzzing input generation LLM. However, due to the probabilistic nature of LLMs, (generate-new). The algorithm concatenates the initial input prompt, Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang strategy name generation instruction Table 1: SUTs and baseline tools. Please create a program which uses complex generate-new {SMT2 logic} for an {SMT solver} Language SUT(s) Baseline tool(s) Version Please create a mutated program that modifies the mutate-existing distillation LLM previous generation C GCC, Clang GrayC [22], Csmith [79] GCC-13.1.1 Please create a semantically equivalent program semantic-equiv to the previous generation C++ G++, Clang++ YARPGen [49] G++-13.1.1 SMT2 supports several theories, SMT2 Z3, CVC5 TypeFuzz [57] CVC5-1.0.5 including integer (declare-const x1 Real) and real arithmetic Go Go go-fuzz [26] go-1.20.6 1 (assert (! (= x1 1))) initial prompt (check-sat) Java javac Hephaestus [11] OpenJDK-javac-18 generate-new generation LLM Python Qiskit MorphQ [56] qiskit-0.43.1 SMT2 supports several theories, (declare-const x1 Int) including integer 4 EXPERIMENTAL DESIGN and real arithmetic (assert (! (= x1 1))) initial prompt (check-sat) We evaluate Fuzz4All on the following research questions: (get-model) example generation LLM • RQ1: How does Fuzz4All compare against existing fuzzers? mutate-existing • RQ2: How effective is Fuzz4All in performing targeted fuzzing? SMT2 supports several theories, (declare-const x1 Int) including integer • RQ3: How do different components contribute to Fuzz4All’s and real arithmetic (assert (! (= x1 1) initial prompt :named a)) effectiveness? (check-sat) example (get-model) generation LLM • RQ4: What real-world bugs does Fuzz4All find? semantic-equiv Figure 3: Fuzzing strategies and example of fuzzing loop. 4.1 Implementation the selected example, and the selected generation strategy into a Fuzz4All is primarily implemented in Python. The autoprompting new prompt, and then queries the generation LLM with this prompt and fuzzing loop components of Fuzz4All contain only 872 LoC. to produce another batch of fuzzing inputs (line 8). Compared to traditional fuzzers, such as Csmith (>80K LoC), which The main fuzzing loop is repeated until the algorithm has ex- need high manual effort to implement generators, Fuzz4All has a hausted the fuzzing budget. For each created fuzzing input, Fuzz4All very lightweight implementation. Fuzz4All uses GPT4 [54] as the passes the input to the SUT. If the user-defined oracle identifies an distillation LLM to perform autoprompting since this model is the unexpected behavior, e.g., a crash, then the algorithm adds a report state-of-the-art for a wide range of NLP-based reasoning tasks [9]. to the set of detected bugs (lines 4 and 9). Specifically, we use the gpt-4-0613 checkpoint with max_token of 500 provided via the OpenAI API [28]. For autoprompting, we sam- 3.2.2 Example: Fuzzing Loop. Figure 3 illustrates how our fuzzing ple four candidate prompts, generate 30 fuzzing inputs each, and loop uses input examples and the generation strategies to create evaluate using a scoring function based on validity rate (as de- different fuzzing inputs. In this case, we are fuzzing an SMT solver scribed in Section 3.1.1). For the fuzzing loop, we use the Hugging where the inputs are logic formulas written in the SMT2 language. Face implementation of the StarCoder [41] model as the generation Initially 1 , there are no examples, and hence, the algorithm uses LLM, which is trained on over one trillion code tokens across over the generate-new strategy to synthesize new fuzzing inputs. Next, 80 languages. Our default setting when generating fuzzing inputs taking a generated, valid fuzzing input as an example, the algo- uses a temperature of 1, a batch size of 30, a maximum output length rithm queries the model to create a new input 2 based on the of 1,024 using nucleus sampling [33] with a top-p of 1. mutate-existing strategy, which aims to mutate the selected exam- ple. We observe that the new fuzzing input subtly modifies the previ- 4.2 Systems Under Test and Baselines ous input by swapping the type of a variable as well as adding some To demonstrate the generality of Fuzz4All, we evaluate it on six in- computation. In the next fuzzing iteration 3 , the algorithm selects put languages and nine SUTs. Table 1 shows each of the languages, the previously generated fuzzing input as the example and uses the SUTs, and the corresponding baseline tools. Note that we compare semantic-equiv generation strategy, which aims to create an input coverage on one SUT per language, with the SUT versions used that does not modify the semantics of the given example. This time, for coverage measurements shown in the last column of Table 1. we observe that the new fuzzing input simply adds a syntax tag to Except for the coverage experiments, we perform fuzzing on the the selected example. In fact, the combination of generation strate- nightly release of each target. Unless otherwise mentioned, we use gies shown in the example helps Fuzz4All to generate a fuzzing unexpected compiler crashes as the oracle and consider a fuzzing input that causes an unexpected crash in the SMT solver. The crash input as valid if it compiles successfully. Each baseline fuzzer is exposes one of the real-world bugs detected by Fuzz4All during run with its default settings. For baseline fuzzers that require input our evaluation, which has been confirmed and fixed by developers. seeds, we use the default seed corpus provided in their replication repository. We now present more evaluation details for each SUT. 3.2.3 Oracle. The fuzzing inputs produced by Fuzz4All during the fuzzing loop can be used to check the behavior of the SUT against 4.2.1 C/C++ Compilers. We target the popular GCC and Clang an oracle to detect bugs. The oracle is custom for each SUT, and it compilers and provide the standard C library documentation as user can be fully defined and customized by the user. For example, when input to Fuzz4All by default. Our baselines include Csmith [79], fuzzing C compilers, a user could define a differential testing oracle a classic generation-based C compiler fuzzer, and GrayC [22], a that compares the compiler behavior under different optimization recent mutation-based fuzzer that uses coverage feedback together levels [79]. In this paper, we focus on simple and easy-to-define with specialized mutation operators. For C++, we target new C++23 oracles, such as crashes due to segmentation faults and internal features by providing the C++23 standard documentation as input assertion failures, with more details discussed in Section 4.2. to Fuzz4All. Our baseline is YARPGen [49], a generation-based Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA Table 2: Fuzz4All against state-of-the-art fuzzers (* indicates fuzzer that extends Csmith with new language features in C++ and statistically significant coverage improvement). generation policies to trigger different compiler optimizations. Target Fuzzer # programs % valid Coverage 4.2.2 SMT Solvers. We run Fuzz4All on Z3 and CVC5 with com- GrayC 104,326 95.96% 167,453 monly enabled developer settings, such as debug and assertion, GCC Csmith 61,883 99.99% 111,668 Fuzz4All 44,324 37.26% *198,927 +18.8% following prior work [57, 75, 76]. Fuzz4All generates SMT for- YARPGen 255,581 99.99% 166,614 mulas as fuzzing inputs using an overview documentation of the G++ Fuzz4All 26,365 40.74% *210,743 +26.5% SMT2 language and SMT solver as input by default. A fuzzing input TypeFuzz 43,001 93.24% 46,174 CVC5 is considered valid if the SMT solver returns either SAT or UNSAT Fuzz4All 36,054 47.63% *57,674 +24.9% without any error. Our baseline is state-of-the-art TypeFuzz [57], go-fuzz 20,002 100.00% 38,024 Go which mutates existing SMT expressions based on newly generated Fuzz4All 22,817 23.02% *43,317 +13.7% Hephaestus 728,217 57.22% 10,285 expressions of the same type. javac Fuzz4All 31,967 49.05% *16,552 +60.9% MorphQ 38,474 100.00% 19,929 4.2.3 Go Toolchain. We run Fuzz4All on the most recent version Qiskit Fuzz4All 33,454 24.90% *34,988 +75.6% of Go. By default, we use the Go standard library documentation as Environment. Experiments are conducted on a 64-core worksta- input to Fuzz4All. As a baseline, we use go-fuzz [26], a coverage- tion with 256 GB RAM running Ubuntu 20.04.5 LTS with 4 NVIDIA guided, mutation-based fuzzer designed for Go, which generates in- RTX A6000 GPUs (only one GPU is used per fuzzing run). puts for various Go standard libraries using handwritten templates. Metrics. We use the widely adopted measure of code coverage for evaluating fuzzing tools [7, 37, 74]. To be uniform, we report the 4.2.4 Java Compiler. We evaluate Fuzz4All on the OpenJDK Java line coverage for each of the targets studied in the evaluation. Fol- compiler, javac, which compiles source code into bytecode. Our de- lowing prior work [37], we use the Mann-Whitney U-test [52] to fault input is the latest standard Java API documentation page. We compute statistical significance and indicate significant (p < 0.05) compare against Hephaestus [11], a recent combined generation- coverage results in applicable tables (Tables 2 and 4) with *. We and mutation-based fuzzer designed for JVM compilers and target- additionally measure the validity rate (% valid) of inputs as the ing type-related bugs. percentage of fuzzing inputs generated that are valid and unique. As Fuzz4All supports both general and targeted fuzzing, to assess 4.2.5 Quantum Computing Platform. We target Qiskit [1], a pop- the effectiveness of targeted fuzzing, we report the hit rate, i.e., ular quantum computing framework [24]. Qiskit is built on top the percentage of fuzzing inputs that use a specific target feature of Python, i.e., both the input program and the compilation are (checked with simple regular expressions). Finally, we also report defined in Python code. Thus, creating a valid input for Qiskit the most important metric and goal of fuzzing: the number of bugs means using the Qiskit Python APIs in a meaningful way, e.g., to detected by Fuzz4All for each of our nine SUTs. create a quantum circuit. It is challenging for traditional synthesis tools to handle dynamically typed general-purpose languages (like Python) [30, 65], not to mention the additional API constraints, 5 RESULTS making fuzzing Qiskit a particularly difficult challenge. Our base- 5.1 RQ1: Comparison against Existing Fuzzers line is MorphQ [56], a recent fuzzer that uses a template- and 5.1.1 Coverage over Time. Figure 4 shows the 24-hour coverage grammar-based approach to generate valid quantum programs and trend of Fuzz4All compared with the baselines, where the solid then applies metamorphic transformations. line shows average coverage and the area indicates the minimum Unlike for the other SUTs, which receive fuzzing inputs in a file, and maximum across five runs. We observe that Fuzz4All achieves to invoke Qiskit, we must run the generated Python program itself. the highest coverage by the end of the fuzzing campaign across all As an oracle, we add statements at the end of the generated Python targets, with an average improvement of 36.8% compared to the top file, which collect all QuantumCircuit objects via Python’s built-in performing baselines. Contrasting with generation-based fuzzers introspection APIs and then apply two oracles on each circuit. The (i.e., YARPGen and MorphQ), Fuzz4All is able to almost immedi- two oracles are directly borrowed from previous work for a fair ately achieve higher coverage, demonstrating the powerful genera- comparison [56]. The first oracle compiles the circuit via a transpile tive ability of LLMs in producing diverse code snippets compared to call with different optimization levels and reports any crash. The traditional program generation techniques. While mutation-based second oracle converts the circuit to its lower-level QASM [16] fuzzers (i.e., go-fuzz and GrayC) are able to achieve higher cov- representation and then reads it back, reporting any crash. erage in the beginning through the use of high quality seeds, the coverage gained via mutations rapidly falls off and Fuzz4All is 4.3 Experimental Setup and Metrics able to slowly but surely cover more code. Note that we include Fuzzing campaigns. For RQ1, we use a fuzzing budget of 24 the autoprompting time as part of the fuzzing budget for a fair hours (including autoprompting), which is used commonly in prior comparison, which incurs negligible overhead (avg. 2.3 minutes per work [37]. To account for variance, we repeat the experiment for fuzzing campaign). both Fuzz4All and the baselines five times. Due to the high cost of Unlike the baseline fuzzers, which reach a coverage plateau by experiments, for later RQs, we use a fuzzing budget of 10,000 gen- the end of the 24-hour period, Fuzz4All keeps finding inputs that erated fuzzing inputs and repeat four times for the ablation study. cover new code, even near the end of the fuzzing campaign. Recall Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang GrayC Fuzz4All 120 YarpGen TypeFuzz seed seed Csmith Fuzz4All Fuzz4All 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 Hours Hours Hours (a) GCC (b) G++ (c) CVC5 go-fuzz seed Hephaestus MorphQ Fuzz4All Fuzz4All Fuzz4All 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 Hours Hours Hours (d) Go (e) javac (f) Qiskit Figure 4: Coverage trend of Fuzz4All against state-of-the-art fuzzers in a 24-hour fuzzing campaign. that during each iteration of Fuzz4All’s fuzzing loop, the original invoke the SUT after each fuzzing iteration for bug detection. Re- input prompt is updated with both a new example and a generation garding validity rate, a general-purpose programming language, strategy (Section 3.2), nudging the LLM to generate new fuzzing such as C, has a relatively lower validity rate compared to domain- inputs. We hypothesize that this allows Fuzz4All to effectively specific languages, such as the SMT2 language used for SMT solvers. generate new and diverse fuzzing inputs even after a long period A more rigorous language, e.g., Go, which does not allow any de- of fuzzing, leading to sustained coverage increase. clared but unused variables, has an even lower validity rate. We also observe a low validity rate for fuzzing quantum computing 5.1.2 Generation Validity, Number, and Coverage. We examine the platforms. As quantum computing is an emerging area with its own number of fuzzing inputs generated and their validity rate across set of library APIs, the generation LLM may not have seen as many our studied SUTs. In Table 2, Column “# programs” represents examples of quantum programs during its training as for more es- the number of unique inputs generated, “% valid” is the percent- tablished languages. Nevertheless, Fuzz4All is still able to leverage age of fuzzing inputs that are valid, and “Coverage” shows the user-provided documentation to generate interesting fuzzing inputs, final coverage obtained by each fuzzer along with the relative im- which leverage quantum library APIs and achieve an impressive cov- provement over the best baseline. We first observe that almost erage improvement (+75.6%) compared to the state-of-the-art fuzzer. all traditional fuzzing tools can achieve a very high validity rate apart from Hephaestus, which purposefully generates invalid code 5.2 RQ2: Effectiveness of Targeted Fuzzing (focused on incorrect types) to check for miscompilation bugs. In contrast, Fuzz4All has a lower percentage of valid fuzzing inputs We now evaluate the ability of Fuzz4All to perform targeted generated (56.0% average reduction compared to baseline tools). fuzzing, i.e., to generate fuzzing inputs that focus on a particu- Furthermore, the raw number of fuzzing inputs generated by base- lar feature. For each target SUT and language, we test by targeting line tools are also much higher. By using an LLM as the generation three different example features and compare them to the setup engine, Fuzz4All is bottlenecked by GPU inference, leading to with general user input, as used for RQ1 (described in Section 4.3). 43.0% fewer fuzzing inputs compared to traditional fuzzers. These features are built-in libraries or functions/APIs (Go, C++ and In spite of the lower validity rate and number of fuzzing inputs, Qiskit), language keywords (C and Java), and theories (SMT). The Fuzz4All generates much more diverse programs compared to tra- user input for the targeted fuzzing runs is documentation of the ditional fuzzing tools, as evidenced by the high coverage obtained particular feature we are focusing on. Table 3 shows the results of (+36.8% average increase). Additionally, even invalid code snippets targeted fuzzing as well as the default general fuzzing used in RQ1. that are close to valid can be useful for fuzzing, as they allow for Each column represents a targeted fuzzing run where we focus finding bugs in the validation logic of the SUT. In Section 5.4, we on one feature. The value in each cell shows the hit rate of the further describe the various types of bugs detected by Fuzz4All, feature (Section 4.3) for a particular fuzzing run. We also include with both valid and invalid code snippets, to additionally showcase the coverage results obtained. the benefit of generating diverse fuzzing inputs. We observe that targeting a specific feature yields a high amount We note that Fuzz4All achieves a wide range of validity rates of fuzzing inputs that directly use the feature, with an average and numbers of fuzzing inputs across different SUTs. The number hit rate of 83.0%. This result demonstrates that Fuzz4All indeed of fuzzing inputs varies across targets due to the varying cost to performs targeted fuzzing by prompting the generation LLM with Coverage (#K lines) Coverage (#K lines) Coverage (#K lines) Coverage (#K lines) Coverage (#K lines) Coverage (#K lines) Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA Table 3: Hit rate and coverage during targeted fuzzing. 5.3 RQ3: Ablation Study C targeted campaign (keywords) To study how each component of Fuzz4All contributes to the over- typedef union goto General all fuzzing effectiveness, we conduct an ablation study based on typedef 83.11% 47.16% 0.48% 4.38% the two key components of Fuzz4All: (a) Autoprompting, the type union 10.80% 80.43% 0.10% 0.32% of initial input prompt provided to the generation LLM; (b) Fuzzing goto 0.22% 0.11% 77.62% 1.16% loop, the use of selected examples and generation strategies. We Coverage 123,226 125,041 120,452 188,148 study three variants for each of the two key components. Table 4 C++ targeted campaign (built-in functions) shows the coverage and validity rate of our studied variants. apply expected variant General apply 70.23% 0.41% 0.68% 0.32% expected 0.26% 79.72% 0.94% 1.33% variant 1.16% 5.98% 93.19% 3.63% Coverage 182,261 175,963 182,333 193,254 5.3.1 Autoprompting. First, we examine the effect of different ini- SMT targeted campaign (theories) tial inputs provided to the generation LLM. To reduce the impact Array BitVec Real General of additional factors, we fix the generation strategy to only use Array 82.23% 2.08% 1.44% 11.07% generate-new and study three variants : 1) no input: does not use BitVec 2.57% 88.48% 0.86% 5.46% Real 1.45% 0.17% 96.01% 17.36% any initial prompts 2) raw prompt: directly use the raw user input as the initial prompt, 3) autoprompt: applies autoprompting to generate Coverage 46,392 48,841 47,619 52,449 the initial prompt. We observe that across all studied languages, the Go targeted campaign (built-in libraries) no input variant achieves the lowest coverage. In no input, we do atomic atomic heap General not provide any initial prompt, which provides useful information atomic 90.09% 0.04% 0.06% 1.01% big 0.18% 97.20% 0.23% 3.63% on the features we want to generate fuzzing inputs for. As such, heap 0.30% 0.04% 91.18% 2.22% the LLM can only generate simple code snippets with high validity Coverage 10,156 12,986 9,790 37,561 rate but is less effective in covering the SUT. We observe a cover- age boost as we use the raw prompt variant, where we provide the Java targeted campaign (keywords) instanceof synchronized finally General raw documentation as the initial prompt. However, we can further improve both the code coverage and the validity rate by using our instanceof 88.00% 0.08% 0.85% 1.86% synchronized 0.16% 94.80% 0.16% 0.85% autoprompting stage to distill the user input into a concise but in- finally 0.51% 3.17% 78.62% 0.82% formative prompt (autoprompt), instead of using the raw user input. Coverage 14,546 13,972 13,203 16,128 Directly using the user-provided input may include information Qiskit targeted campaign (APIs) that is irrelevant for fuzzing, leading to both a lower validity rate switch for loop linear General (as the generation LLM may struggle to understand the raw doc- switch 71.76% 0.00% 0.00% 0.00% umentation) and lower coverage (since, unlike our autoprompting for loop 0.17% 75.97% 0.00% 0.00% generated prompt, the raw documentation is not designed to be linear 0.00% 0.00% 54.79% 0.00% used for LLM generation). Coverage 30,597 26,703 29,535 33,853 an input prompt that describes a particular feature. Furthermore, we observe that fuzzing on features that are related can lead to a moderately high cross-feature hit rate (i.e., hit rate of feature X on fuzzing run for feature Y). For example, the C keywords typedef 5.3.2 Fuzzing loop. Next, we examine the different variants of our and union are both related to type operations, and hence, their fuzzing loop setup by keeping the initial prompt the same (by using the default autoprompting): 1) w/o example: does not select an exam- cross-feature hit rate is high compared to an unrelated feature, such ple during the fuzzing loop (i.e., it continuously samples from the as goto. As shown in Table 3, a general fuzzing approach, while same initial prompt), 2) w/ example: selects an example but only uses achieving the highest overall code coverage, can be extremely inef- the generate-new instruction , 3) Fuzz4All: the full approach with ficient in targeting a specific feature (average 96.0% reduction in hit all generation strategies used. We first observe that by only sam- rate compared with Fuzz4All’s targeted fuzzing). For example, in pling from the same input (w/o example), LLMs will often repeatedly Qiskit, the general fuzzing campaign has a 0% hit rate of the three generate the same or similar fuzzing inputs. On average, 8.0% of target features. This can be explained by the fact that these features the fuzzing inputs generated are repeated in w/o example compared were added recently to Qiskit and not yet widely used, thus being to only 4.7% when using the full Fuzz4All approach. Adding an extremely rare in the LLM training data. However, by providing example to the input prompt (w/ example) avoids sampling from suitable user input during the targeted fuzzing campaign, Fuzz4All the same distribution and improves both coverage and validity can successfully generate fuzzing inputs that use these new features. rate. Finally, the full Fuzz4All approach achieves the highest cov- This ability of Fuzz4All will be valuable to developers who want erage across all SUTs. Compared to the w/ example variant (the to test novel features or components of a SUT. second-best), the full Fuzz4All adds additional generation strate- gies, semantic-equiv and mutate-existing, which help to further provide useful instructions to the generation LLM. Hit rate Hit rate Hit rate Hit rate Hit rate Hit rate Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang Table 4: Effectiveness of variants (* indicates statistically significant coverage improvement compared w/ 2nd best variant). C C++ SMT Go Java Qiskit Variants Description Cov. % valid Cov. % valid Cov. % valid Cov. % valid Cov. % valid Cov. % valid no input no initial prompt 127,261 42.57% 181,493 51.63% 50,838 49.49% 35,765 39.54% 14,374 50.25% 31,701 34.63% raw prompt use user-provided input 137,204 33.95% 189,030 33.79% 49,697 39.49% 36,168 16.84% 15,445 37.64% 31,922 22.74% autoprompt apply autoprompting 182,530 39.09% 190,318 36.62% 51,496 45.04% 36,732 24.87% 15,838 45.54% 32,691 29.12% w/o example generate-new w/o example 143,349 34.23% 190,288 28.25% 50,089 18.41% 35,839 19.38% 15,444 44.69% 32,663 24.04% w/ example generate-new w/ example 182,530 39.09% 190,318 36.62% 51,496 45.04% 36,732 24.87% 15,838 45.54% 32,691 29.12% Fuzz4All all strategies w/ example 185,491 40.58% *193,845 41.22% *53,069 50.06% *37,981 32.00% *16,209 50.99% *33,913 27.45% Table 5: Summary of Fuzz4All-detected bugs. The developers have already confirmed and fixed this bug. Interest- ingly, they even added a slightly modified version of our submitted Confirmed Total Pending Won’t fix code snippet to the official test suite of GCC. Unknown Known Figure 5b shows a bug found in Clang, where the invalid code GCC 22 10 6 6 0 leads to a segmentation fault. Fuzz4All uses an unusual syntax Clang 20 13 7 0 0 for function declaration (i.e., auto x (...) -> return_type ), which CVC5 6 4 2 0 0 Z3 12 10 0 0 2 makes use of the decltype operation in C++. However, the bug Go 4 2 2 0 0 occurs when the throw statement inside of the decltype is evalu- Java 1 0 0 1 0 ated first, skipping the evaluation of the return type since throw Qiskit 11 8 2 1 0 exits the scope early and crashes Clang. This code, while invalid, Total 76 47 19 8 2 is still useful to reveal a bug in the Clang frontend as confirmed by developers. Additionally, prior fuzzing tools can hardly find this #include <optional> bug since they typically focus on generating valid code only and void y(stdoptional< int> z) noexcept(noexcept(stdoptional< int>{z})) {} do not handle the especially difficult-to-model decltype function. (a) GCC bug: Internal compiler error (segmentation fault) Figure 5c shows a bug found in Go where a nil input causes a #include <iostream> segmentation fault instead of producing a useful failure message. using E = stdnumeric_limits< int>; This bug is found by targeting the runtime Go standard library, auto fail(E e) decltype( throw e, void()) { throw e; } where we provide the documentation, which includes the descrip- (b) Clang bug: Segmentation fault tion of the ReadMemStats function. The bug has been confirmed and package main import ("runtime") fixed by the developers. While this bug might look simple (invoking func main() { runtime.ReadMemStats(nil) } a singular function), it cannot be found by the go-fuzz baseline (c) Go bug: Segmentation violation simply because go-fuzz requires manually written templates to tar- from qiskit import QuantumCircuit, ClassicalRegister get specific libraries, and runtime is not a part of any such template. crz = ClassicalRegister(1, name="crz") qc = QuantumCircuit(crz) With Fuzz4All, users can directly target any Go standard libraries qc.qasm(filename="my.qasm") by providing relevant input information (e.g., documentation). QuantumCircuit.from_qasm_file("my.qasm") Figure 5d shows a bug found in Qiskit’s QASM exporter. A quan- (d) Qiskit bug: Crash tum program, represented by the qc variable, is exported to QASM, Figure 5: Exemplary bugs found by Fuzz4All. a low level representation, silently generating an invalid output file, 5.4 RQ4: Bug Finding which leads to a crash when being reimported. The problem is that Table 5 summarizes the bugs found by Fuzz4All on our nine stud- the exporter represents the register in QASM using its name as iden- ied SUTs. In total, Fuzz4All detects 76 bugs, with 47 bugs already tifier, i.e., "crz", which also is the name of a well-known operation confirmed by developers as previously unknown. These results not of the QASM language, thus making the generated code ambiguous. only demonstrate the practical effectiveness of Fuzz4All in finding Note that prior work [56] could not find this bug because they large amounts of bugs but also the promised generality of Fuzz4All use pre-defined templates with only anonymous registers, whereas across languages and SUTs. Fuzz4All effectively leverages the quantum knowledge of LLMs to inject a meaningful string literal for detecting this bug. 5.4.1 Examples. Figure 5a shows a bug found in GCC when using noexcept(x), a C++ feature that specifies a function is non-throwing if x evaluates to true. In this example bug, Fuzz4All generates a rather complex code using std::optional, which indicates that a 6 THREATS TO VALIDITY particular value may or may not be present at runtime. While this Internal. The main internal threat comes from the implementa- code is valid and should compile correctly, this combination of dif- tion of Fuzz4All. To address this, we performed code reviews and ficult runtime dependencies cause GCC to crash with an internal testing to ensure correctness. Furthermore, we run each baseline compiler error. We note that this bug cannot be found by prior from their provided replication package whenever possible. techniques since they simply do not support the noexcept feature. External. The main external threat is our evaluation targets. To support our generality claim, we apply Fuzz4All on nine different The impact of additional generation strategies can be found in Section 5.3.2. 2 SUTs across six languages. Additionally, to account for variance Note that autoprompt and w/ example are the same variant, but we include them separately for ease of comparison. in long fuzzing runs, we repeat the 24-hour fuzzing campaign five Fuzzing Auto loop prompt. Universal Fuzzing via Large Language Models Conference’17, July 2017, Washington, DC, USA times and check for statistically significant results. Since the gen- [13] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, eration LLM leverages the knowledge acquired during its training et al. 2021. Evaluating large language models trained on code. arXiv preprint done within the last year, reapplying Fuzz4All using the exact arXiv:2107.03374 (2021). [14] Yongheng Chen, Rui Zhong, Hong Hu, Hangfan Zhang, Yupeng Yang, Dinghao checkpoint of the LLM (StarCoder) used in this work might degrade Wu, and Wenke Lee. 2021. One engine to fuzz’em all: Generic language processor the effectiveness in the future due to data-shift. Fuzz4All can mit- testing with semantic validation. In 2021 IEEE Symposium on Security and Privacy igate this using the autoprompting step where more up-to-date (SP). IEEE, 642–658. [15] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav documentation/example code allows the model to also generate Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- up-to-date fuzzing inputs. One additional threat comes from the tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, use of the distillation LLM to generate the initial inputs, where Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, the LLM may “hallucinate”, i.e., produce made-up or inaccurate Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay information [31] . This limitation is common to most pipelines that Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek use LLMs, and we hope to address it in our future work. Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr 7 CONCLUSION Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, We present Fuzz4All, a universal fuzzer leveraging LLMs to sup- Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Modeling port both general and targeted fuzzing of arbitrary SUTs that take with Pathways. arXiv:2204.02311 [cs.CL] in a multitude of programming languages. Fuzz4All uses a novel [16] Andrew W. Cross, Lev S. Bishop, John A. Smolin, and Jay M. Gambetta. 2017. Open Quantum Assembly Language. arXiv:1707.03429 [quant-ph] (July 2017). autoprompting stage to produce input prompts that concisely sum- arXiv:1707.03429 [quant-ph] marize the user-provided inputs. In its fuzzing loop, Fuzz4All [17] Chris Cummins, Pavlos Petoumenos, Alastair Murray, and Hugh Leather. 2018. Compiler fuzzing through deep learning. In Proceedings of the 27th ACM SIGSOFT iteratively updates the initial input prompt with both code exam- International Symposium on Software Testing and Analysis. 95–105. ples and generation strategies aimed at producing diverse fuzzing [18] Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Ling- inputs. Evaluation results on nine different SUTs across six differ- ming Zhang. 2023. Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. In ISSTA 2023. 423–435. ent languages demonstrate that Fuzz4All is able to significantly [19] Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing improve coverage compared to state-of-the-art tools. Furthermore, Yang, and Lingming Zhang. 2023. Large language models are edge-case fuzzers: Fuzz4All is able to detect 76 bugs with 47 already confirmed by Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014 (2023). [20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: developers as previously unknown. Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [21] Karine Even-Mendoza, Cristian Cadar, and Alastair F Donaldson. 2022. REFERENCES CsmithEdge: more effective compiler testing by handling undefined behaviour [1] 2021. Qiskit/Qiskit. https://github.com/Qiskit/qiskit. less conservatively. Empirical Software Engineering 27, 6 (2022), 129. [2] 2023. std::expected. https://en.cppreference.com/w/cpp/utility/expected. [22] Karine Even-Mendoza, Arindam Sharma, Alastair F. Donaldson, and Cristian [3] Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig, Cadar. 2023. GrayC: Greybox Fuzzing of Compilers and Analysers for C (ISSTA Ahmad-Reza Sadeghi, and Daniel Teuchert. 2019. NAUTILUS: Fishing for Deep 2023). Association for Computing Machinery, New York, NY, USA, 1219–1231. Bugs with Grammars.. In NDSS. https://doi.org/10.1145/3597926.3598130 [4] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan [23] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and Pre-Trained Model for Programming and Natural Languages. arXiv:2002.08155. interactivity. arXiv preprint arXiv:2302.04023 (2023). [24] Mark Fingerhuth, Tomáš Babej, and Peter Wittek. 2018. Open Source Soft- [5] Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel. ware in Quantum Computing. PLOS ONE 13, 12 (Dec. 2018), e0208561. 2022. Code Generation Tools (Almost) for Free? A Study of Few-Shot, https://doi.org/10.1371/journal.pone.0208561 Pre-Trained Language Models on Code. CoRR abs/2206.01335 (2022). [25] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, https://doi.org/10.48550/arXiv.2206.01335 arXiv:2206.01335 Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A [6] Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. 2020. Fuzzing: generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 Challenges and reflections. IEEE Software 38, 3 (2020), 79–86. (2022). [7] Marcel Böhme, László Szekeres, and Jonathan Metzman. 2022. On the reliability [26] go-fuzz 2023. go-fuzz: randomized testing for Go. https://github.com/dvyukov/go- of coverage-based fuzzer benchmarking. In ICSE 2022. 1621–1633. fuzz. [8] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra- [27] Patrice Godefroid, Hila Peleg, and Rishabh Singh. 2017. Learn&fuzz: Machine fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, learning for input fuzzing. In ASE 2017. IEEE, 50–59. Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon [28] gpt4endpoint 2023. Models - GPT-4. https://platform.openai.com/docs/models/ Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher gpt- 4. Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack [29] Alex Groce, Rijnard van Tonder, Goutamkumar Tulajappa Kalburgi, and Claire Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Le Goues. 2022. Making no-fuss compiler fuzzing effective. In Proceedings of the Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165. 31st ACM SIGPLAN International Conference on Compiler Construction. 194–204. [9] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric [30] Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Foundations and Trends® in Programming Languages 4, 1-2 (2017), 1–119. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv [31] Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos. 2022. A survey on preprint arXiv:2303.12712 (2023). automated fact-checking. Transactions of the Association for Computational [10] Alexander Bulekov, Bandan Das, Stefan Hajnoczi, and Manuel Egele. 2023. No Linguistics 10 (2022), 178–206. Grammar, No Problem: Towards Fuzzing the Linux Kernel without System-Call [32] Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with code Descriptions. In Network and Distributed System Security (NDSS) Symposium 2023. fragments. In 21st USENIX Security Symposium (USENIX Security 12). 445–458. [11] Stefanos Chaliasos, Thodoris Sotiropoulos, Diomidis Spinellis, Arthur Gervais, [33] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The Curious Benjamin Livshits, and Dimitris Mitropoulos. 2022. Finding typing compiler bugs. Case of Neural Text Degeneration. arXiv:1904.09751. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming [34] Bo Jiang, Xiaoyan Wang, Wing Kwong Chan, TH Tse, Na Li, Yongfeng Yin, and Language Design and Implementation. 183–198. Zhenyu Zhang. 2020. Cudasmith: A fuzzer for CUDA compilers. In 2020 IEEE [12] Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan 44th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, Hao, and Lu Zhang. 2020. A survey of compiler testing. ACM Computing Surveys 861–871. (CSUR) 53, 1 (2020), 1–36. Conference’17, July 2017, Washington, DC, USA Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang [35] jsfunfuzz 2017. Introducing jsfunfuzz. https://www.squarefree.com/2007/08/ [63] John Schulman, Barret Zoph, Jacob Hilton Christina Kim, Jacob Menick, Jiayi 02/introducing- jsfunfuzz/. Weng, Juan Felipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, [36] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rapha Gontijo Lopes, Shengjia Zhao, Arun Vijayvergiya, Eric Sigler, Adam Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Perelman, Chelsea Voss, Mike Heaton, Joel Parish, Dave Cummings, Rajeev Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020). Nayak, Valerie Balcom, David Schnurr, Tomer Kaftan, Chris Hallacy, Nicholas [37] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Turley, Noah Deutsch, Vik Goel, Jonathan Ward, Aris Konstantinidis, Wojciech Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Zaremba, Long Ouyang, Leonard Bogdonoff, Joshua Gross, David Medina, Sarah Computer and Communications Security (CCS ’18). Association for Computing Ma- Yoo, Teddy Lee, Ryan Lowe, Dan Mossing, Joost Huizinga, Roger Jiang, Carroll chinery, New York, NY, USA, 2123–2138. https://doi.org/10.1145/3243734.3243804 Wainwright, Diogo Almeida, Steph Lin, Marvin Zhang, Kai Xiao, Katarina Slama, [38] Suyoung Lee, HyungSeok Han, Sang Kil Cha, and Sooel Son. 2020. Montage: A Steven Bills, Alex Gray, Jan Leike, Jakub Pachocki, Phil Tillet, Shantanu Jain, Greg Neural Network Language{ Model-Guided}{JavaScript} Engine Fuzzer. In 29th Brockman, and Nick Ryder. 2022. ChatGPT: Optimizing Language Models for USENIX Security Symposium (USENIX Security 20). 2613–2630. Dialogue. (2022). https://openai.com/blog/chatgpt/. [39] Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha [64] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. Adaptive Test Sen. 2023. CODAMOSA: Escaping Coverage Plateaus in Test Generation with Generation Using a Large Language Model. arXiv:2302.06527 [cs.SE] Pre-trained Large Language Models. In ICSE 2023. [65] Kensen Shi, David Bieber, and Rishabh Singh. 2022. Tf-coder: Program synthesis [40] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman for tensor manipulations. ACM Transactions on Programming Languages and Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising Systems (TOPLAS) 44, 2 (2022), 1–36. sequence-to-sequence pre-training for natural language generation, translation, [66] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. and comprehension. arXiv preprint arXiv:1910.13461 (2019). 2020. Autoprompt: Eliciting knowledge from language models with automatically [41] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, generated prompts. arXiv preprint arXiv:2010.15980 (2020). Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. [67] Michael Sutton, Adam Greene, and Pedram Amini. 2007. Fuzzing: Brute Force StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023). Vulnerability Discovery. Addison-Wesley Professional. [42] Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous [68] syzkaller 2023. syzkaller - kernel fuzzer. https://github.com/google/syzkaller. prompts for generation. arXiv preprint arXiv:2101.00190 (2021). [69] Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Srivastava, and Colin [43] libFuzzer 2023. libFuzzer – a library for coverage-guided fuzz testing. Raffel. 2021. Improving and simplifying pattern exploiting training. arXiv preprint https://llvm.org/docs/LibFuzzer.html. arXiv:2103.11955 (2021). [44] Christopher Lidbury, Andrei Lascu, Nathan Chong, and Alastair F Donaldson. [70] TensorFlow 2023. TensorFlow. https://www.tensorflow.org. 2015. Many-core compiler fuzzing. ACM SIGPLAN Notices 50, 6 (2015), 65–76. [71] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, [45] Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Jinyang Li, Aurojit Panda, and Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you Lingming Zhang. 2023. Nnsmith: Generating diverse and valid test cases for deep need. Advances in neural information processing systems 30 (2017). learning compilers. In ASPLOS 2023, Volume 2. 530–543. [72] Vasudev Vikram, Caroline Lemieux, and Rohan Padhye. 2023. Can Large Language [46] Jiawei Liu, Yuxiang Wei, Sen Yang, Yinlin Deng, and Lingming Zhang. 2022. Models Write Good Property-Based Tests? arXiv preprint arXiv:2307.04346 (2023). Coverage-guided tensor compiler fuzzing with joint ir-pass mutation. Proceedings [73] Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and of the ACM on Programming Languages 6, OOPSLA1 (2022), 1–26. Michael R Lyu. 2022. No more fine-tuning? an experimental evaluation of prompt [47] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and tuning in code intelligence. In ESEC/FSE 2022. 382–394. Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of [74] Anjiang Wei, Yinlin Deng, Chenyuan Yang, and Lingming Zhang. 2022. Free Prompting Methods in Natural Language Processing. CoRR abs/2107.13586 (2021). lunch for testing: Fuzzing deep-learning libraries from open source. In ICSE 2022. arXiv:2107.13586 https://arxiv.org/abs/2107.13586 995–1007. [48] Xiao Liu, Xiaoting Li, Rupesh Prajapati, and Dinghao Wu. 2019. Deepfuzz: [75] Dominik Winterer, Chengyu Zhang, and Zhendong Su. 2020. On the unusual Automatic generation of syntax valid c programs for fuzz testing. In Proceedings effectiveness of type-aware operator mutations for testing SMT solvers. Proc. of the AAAI Conference on Artificial Intelligence , Vol. 33. 1044–1051. ACM Program. Lang. 4, OOPSLA (2020), 193:1–193:25. [49] Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for [76] Dominik Winterer, Chengyu Zhang, and Zhendong Su. 2020. Validating SMT C and C++ compilers with YARPGen. Proceedings of the ACM on Programming Solvers via Semantic Fusion. In Proceedings of the 41st ACM SIGPLAN Conference Languages 4, OOPSLA (2020), 1–25. on Programming Language Design and Implementation. 718–730. [50] M. Zalewski 2016. American Fuzzy Lop - Whitepaper. https: [77] Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: //lcamtuf.coredump.cx/afl/technical_details.txt. Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint [51] Haoyang Ma. 2023. A Survey of Modern Compiler Fuzzing. arXiv preprint arXiv:2304.00385 (2023). arXiv:2306.06884 (2023). [78] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. [52] Henry B Mann and Donald R Whitney. 1947. On a test of whether one of A Systematic Evaluation of Large Language Models of Code (MAPS 2022). two random variables is stochastically larger than the other. The annals of Association for Computing Machinery, New York, NY, USA, 1–10. mathematical statistics (1947), 50–60. [79] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and [53] Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond J. Mooney, and Milos understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN Gligoric. 2023. Learning Deep Semantics for Test Completion. In 45th International conference on Programming language design and implementation. 283–294. Conference on Software Engineering. [80] Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, [54] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] and Xin Peng. 2023. No More Manual Tests? Evaluating and Improving ChatGPT [55] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela for Unit Test Generation. arXiv:2305.04207 [cs.SE] Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. [81] Shafiq Joty Yue Wang, Weishi Wang and Steven C.H. Hoi. 2021. CodeT5: Identifier- Training language models to follow instructions with human feedback. Advances aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and in Neural Information Processing Systems 35 (2022), 27730–27744. Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural [56] Matteo Paltenghi and Michael Pradel. 2023. MorphQ: Metamorphic Testing of Language Processing, EMNLP 2021. the Qiskit Quantum Computing Platform. In 2023 IEEE/ACM 45th International [82] Andreas Zeller, Rahul Gopinath, Marcel Böhme, Gordon Fraser, and Christian Conference on Software Engineering (ICSE). IEEE Computer Society, 2413–2424. Holler. 2019. The fuzzing book. https://doi.org/10.1109/ICSE48619.2023.00202 [83] Hui Zhao, Zhihui Li, Hansheng Wei, Jianqi Shi, and Yanhong Huang. 2019. [57] Jiwon Park, Dominik Winterer, Chengyu Zhang, and Zhendong Su. 2021. SeqFuzzer: An Industrial Protocol Fuzzing Framework from a Deep Learning Generative type-aware mutation for testing SMT solvers. Proceedings of the ACM Perspective. In 2019 12th IEEE Conference on Software Testing, Validation and on Programming Languages 5, OOPSLA (2021), 1–19. Verification (ICST) . 59–67. https://doi.org/10.1109/ICST.2019.00016 [58] Jibesh Patra and Michael Pradel. 2016. Learning to fuzz: Application-independent [84] Yingquan Zhao, Zan Wang, Junjie Chen, Mengdi Liu, Mingyuan Wu, Yuqun fuzz testing with probabilistic, generative models of input data. (2016). Zhang, and Lingming Zhang. 2022. History-Driven Test Program Synthesis [59] PyTorch 2023. PyTorch. http://pytorch.org. for JVM Testing. In Proceedings of the 44th International Conference on Software [60] Guanghui Qin and Jason Eisner. 2021. Learning How to Ask: Querying LMs Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). 1133–1144. with Mixtures of Soft Prompts. In Proceedings of the 2021 Conference of the [85] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, North American Chapter of the Association for Computational Linguistics: Human Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt Language Technologies (NAACL-HLT). engineers. arXiv preprint arXiv:2211.01910 (2022). [61] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. [86] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Improving language understanding by generative pre-training. (2018). Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-Tuning Language [62] Timo Schick and Hinrich Schütze. 2020. Exploiting cloze questions for few shot Models from Human Preferences. arXiv:1909.08593. text classification and natural language inference. arXiv preprint arXiv:2001.07676 (2020).

Journal

Computing Research Repository – arXiv (Cornell University)

Published: Aug 9, 2023

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Universal Fuzzing via Large Language Models

Universal Fuzzing via Large Language Models

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Universal Fuzzing via Large Language Models

Universal Fuzzing via Large Language Models

References (86)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies