How Language Model Hallucinations Can Snowball

Muru Zhang; Ofir Press; William Merrill; Alisa Liu; Noah A. Smith

doi:10.48550/arxiv.2305.13534

How Language Model Hallucinations Can Snowball

Zhang, Muru;Press, Ofir;Merrill, William;Liu, Alisa;Smith, Noah A. 2023-05-22 00:00:00 A major risk of using language models in practi- cal applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying previously generated hallucinations, LMs out- put false claims that they can separately recog- nize as incorrect. We construct three question- answering datasets where ChatGPT and GPT-4 often state an incorrect answer and offer an explanation with at least one incorrect claim. Crucially, we find that ChatGPT and GPT-4 can identify 67% and 87% of their own mistakes, re- Figure 1: GPT-4 mistakenly claims that 9677 is not spectively. We refer to this phenomenon as hal- prime, followed by an incorrect explanation that 13 × lucination snowballing: an LM over-commits 745 = 9677. We refer to this factorization as a snow- to early mistakes, leading to more mistakes that balled hallucination, as GPT-4 appears to “know” that it otherwise would not make. 13 is not a factor of 9677 when asked separately. Thus, hallucinations are not necessarily reflective of a model’s 1 Introduction lack of “knowledge,” but may be induced by an earlier Language models are increasingly being de- hallucination for consistency. ployed to interface with humans in open-ended information-seeking and problem-solving settings. Despite their diverse capabilities and extreme flu- To study this behavior empirically, we auto- ency, a major open challenge is that LMs still hal- matically construct three question-answering (QA) lucinate by making up facts or citing sources that datasets. These datasets span different domains: do not exist (Maynez et al., 2020; Liu et al., 2023, determining whether a number is prime, whether i.a.), often while sounding extremely plausible. there is a U.S. senator satisfying two given con- Hallucination is commonly attributed to knowl- straints, and whether two cities are connected given edge gaps in LMs (Zheng et al., 2023), motivating a set of flights between cities. Empirically, we find mitigation strategies through retrieval over knowl- that ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, edge bases (Lewis et al., 2020; Shuster et al., 2021; 2023) commit to an answer within the first token Peng et al., 2023) But, do LMs only hallucinate (Yes/No) over 95% of the time; these answers are when they do not “know” a fact? We present a often incorrect, and then followed by an incorrect setting where LMs often generate hallucinations explanation. Yet, when presented with the incor- rect explanation alone, we nd fi that the LM is likely that they immediately recognize as wrong when presented in isolation. Specifically, after an LM able to recognize it as incorrect. answers a question incorrectly, it usually justifies We refer to this phenomenon as hallucination that answer by making incorrect assertions that it snowballing. We hypothesize that LMs produce separately acknowledges as incorrect (Figure 1). snowballed hallucinations for consistency with ear- 1 lier hallucinations (rather than due to a “knowledge Our data and code are available at: https://github. com/Nanami18/Snowballed_Hallucination gap” in the model), as they recognize the snow- arXiv:2305.13534v1 [cs.CL] 22 May 2023 Dataset Original Question Verification Question User: Is 10733 divisible by 3? An- User: Is 10733 a prime number? swer with either Yes or No. Primality Æ GPT-4: No... It can be factored into 3 × 3577. Testing Æ GPT-4: No User: Was there ever a US senator that represented the state User: Was John P. Hale’s alma mater of New Hampshire and whose alma mater was the University of University of Pennsylvania? Senator Pennsylvania? Æ GPT-4: No... [it] was Bowdoin Search Æ GPT-4: Yes... His name was John P. Hale User: Current flight information (the following flights are User: [...flight information given in one-way only, and all the flights available are included below): the context...] There is a flight from city F to city K Based on the above flight information, There is a flight from city H to city A is City K to City G a valid flight? Graph [... 10 other rules cut for space ...] Æ GPT-4: No, based on the above Connectivity Question: Is there a series of flights that goes from city B to city flight information, there is no direct E? flight from City K to City G. Æ GPT-4: Yes... the route is as follows: ... City K to City G... Table 1: Original Question: An example from each dataset and GPT-4’s (incorrect) response, with the snowballed hallucination underlined. Verification Question : The corresponding query used to probe whether GPT-4 recognizes its claim is incorrect, with its refutation of the hallucination underlined. balled hallucination is incorrect when presented in with two key properties: isolation (i.e., in a separate interaction session). 1. Initial committal: The prompt leads the LM While prompting strategies that encourage the to first state an answer ( before outputting the LM to reason before stating an answer improve explanation). This applies to many yes/no accuracy on the task, our work points to the broader questions. issue that conditioning on faulty context leads LMs 2. Inherently sequential: Transformers cannot to produce extremely simple mistakes that they find the answer within one timestep because wouldn’t otherwise make. Indeed, when prompting of their limited reasoning abilities within one with “Let’s think step by step” (Kojima et al., 2023), timestep. snowballed hallucinations still occur in 95% of We now discuss how these properties may lead to cases where the model fails to answer correctly. snowballed hallucination. We observe that sometimes even when “Let’s think step by step” does lead to the right answer, it uses Initial committal. In English and many other invalid reasoning chains. languages, speakers often say the final Yes/No an- In this paper, we demonstrate the phenomenon swers to questions before explaining their answer. We therefore hypothesize that LMs and especially of hallucination snowballing by leveraging recent LMs’ tendency to state and justify their answers. instruction-tuned LMs (Wei et al., 2021; Sanh et al., 2021; Ouyang et al., 2022; Wang et al., 2022) will Rather than over-committing to its previously gen- reflect this answer format where the answer comes erated context, we believe that LMs should ac- knowledge their initial mistake, and then revise before the explanation. Indeed, on our datasets (presented in §3.1), we observe that GPT-4 and their answer. We have indeed observed GPT-4 do- ing this in a limited number of cases; amplifying ChatGPT immediately commit to an answer to the question: the first token is Yes or No 95.67% and this behavior would be beneficial, as well as devel- oping new methods in which LMs can backtrack. 98.40% of the time for GPT-4 and ChatGPT respec- tively. In the remaining cases, the model often com- mits to an answer within the first few tokens of the 2 Why do we expect hallucination response (e.g., “There is no record of a U.S. Sena- snowballing? tor...”). Crucially, once the LM generates Yes or No, In this section, we explain why we hypothesize that that token remains in the context, and coherence LMs are susceptible to hallucination snowballing. would require commitment to that choice through We predict that snowballing will occur on questions the subsequent justification. Thus, the model pro- duces an answer to a complex question in a single For each dataset, we fix one specific label for timestep, and it then continues by generating an all examples, so that if the model chooses the in- explanation for that answer, which inevitably will correct answer (e.g., that 9677 is not prime), it be incorrect. would produce a specific claim to support it (e.g., an incorrect factorization). This enables us to sys- Inherently sequential. Furthermore, transform- tematically examine model-written justifications ers cannot solve inherently sequential reasoning for incorrect answers. problems like primality testing or graph connec- tivity within a single timestep, as documented Primality testing For this dataset, we query in recent theoretical results (Merrill and Sabhar- the primality of 500 randomly chosen primes be- wal, 2023). Our graph connectivity and primality tween 1,000 and 20,000; the correct answer is al- datasets are concrete instantiations of these prob- ways Yes. When the model answers incorrectly, lems. Because the transformer must use one step to we expect it to justify its answer with an incorrect answer a question that requires multiple timesteps factorization. to answer correctly, it will necessarily sometimes Senator search This dataset consists of 500 commit to an incorrect answer. We hypothesize questions of the form “Was there ever a US sena- that this leads the LM to hallucinate supporting tor that represented the state of x and whose alma incorrect facts that it otherwise would not generate. mater was y?” where x is a U.S. state and y is 3 Experiments a U.S. college. For these questions, the correct answer is always No. When the model answers We design three QA datasets with the properties incorrectly, we expect it to falsely claim that a par- described in §2 to probe hallucination snowballing, ticular senator both represented x and attended y. and evaluate ChatGPT and GPT-4. We first check To create the dataset we consider all U.S. states whether the LM returns the correct answer to the and a manually constructed list of twelve popular given question, and we show that when the model U.S. colleges (see §A for the full list); for each returns the wrong answer, it frequently provides an possible pair, we generate a question following the incorrect explanation for that wrong answer. We template, and manually remove pairs where the automatically extract the incorrect claim in the ex- answer is Yes. planation and ask the same LM to check whether its claim is correct. See Table 1 for a representative Graph connectivity For each of the 500 ques- example from each dataset. tions in this dataset, we present 12 flights among 14 cities, and ask if there is a sequence of flights from 3.1 Datasets a particular city to another. The problem always We design three QA datasets, each containing 500 corresponds to the same underlying directed graph yes/no questions that we expect are not answerable structure (see §A.1), where flights are edges and by transformers in one timestep. To aid evalua- cities are nodes. For each instance in the dataset, tion, the questions are designed so that an incorrect we randomly assign letters from the English alpha- answer would be justified with easily verifiable bet to name the nodes. To formulate the query, we claims. sample a source city s and destination city t in dif- ferent subgraphs, with the additional constraint that Technically, this holds only for inputs above a certain hardness level, i.e., the size of the prime number for primality s corresponds to a source node, and t a leaf node, testing, or the size of the graph for graph connectivity. so that 1-step heuristics cannot be used to solve the Merrill and Sabharwal (2023) show that, with a single problem. generation step, bounded-precision transformers cannot solve any problem outside the complexity class TC , which corre- We formulate the problem as a flight-finding sponds to a highly parallelizable subclass of both L (log-space) question in natural language so that it sounds more and P (polynomial-time). Graph connectivity is an L-complete 0 0 natural: in the prompt, we list the twelve flights problem, which means it cannot be in TC unless TC = L, i.e., all of L can be parallelized to a surprisingly high degree. (“There is a flight from city F to city K; there is a Primality testing was shown to be in P (Agrawal et al., 2004) flight from city G to city N, ...”), followed by the but cannot be in TC unless it is also in L; i.e., any n can be question “Is there a series of flights... from s to t?”. factored with O(log log n) bits of overhead. In summary, un- less standard complexity-theoretic conjectures are false, graph Note the correct answer is always No. When the connectivity and primality testing are outside TC and thus model answers incorrectly, we expect it to justify are too inherentially sequential for transformers to solve in a single generation (cf. Merrill and Sabharwal, 2023). its answer with a flight that does not exist. 0.04 0.02 0.00 0.02 Wrong answer (Hallucination) 0.04 Model knows it's wrong (Snowballed Hallucination) Graph Senator Primality 100 100 Connectivity Search Testing 80 80 60 60 40 40 20 20 0 0 Primality Senator Graph Primality Senator Graph Testing Search Connectivity Testing Search Connectivity (a) ChatGPT (b) GPT-4 Figure 2: Percentage of hallucination and percentage of snowballed hallucination (both calculated with respect to the entire dataset) for ChatGPT and GPT-4. The precise numbers for this plot are available in Table 6 and Table 7 in the Appendix. 3.2 Inference Setup models (e.g., a different model might not provide factors when supporting the claim that a number is Language models. We run all experiments not prime). on ChatGPT (gpt-3.5-turbo) and GPT-4 with greedy decoding. Graph Connectivity For each sample where the Our experiments are zero-shot (i.e., we do not model thinks there is a series of connecting flights show the model any example QA pairs in the (where answer starts with Yes), we manually ex- prompt). We focus on the model behavior under tract the list of flights from the model’s output and the direct prompt (see §A for full examples), which identify the invalid or discontinuous flights. is the most common way users interact with LMs. We then, in a new session, ask the model to ver- See §4 for experiments with the zero-shot chain-of- ify whether the extracted flights are valid based on thought style prompting method. the flight information, and if consecutive flights are For each dataset, we perform a two-stage evalua- indeed connected. We manually assess the verifica- tion. First, we evaluate the model’s accuracy (i.e., tion output to check if the model correctly detects how many of the questions it answers correctly). the error. See Appendix Table 3 for how we prompt When either models is incorrect, empirically it al- the model and an example of successful verifica- ways generates a justification. In the second stage, tion. we assess whether the model can identify the incor- rect step in the explanation. Primality Testing For each sample where the For a given question, we evaluate the model’s model answers that the number is not prime, we response by examining whether the output begins extract the factors the model uses to justify it. The with either Yes or No. In cases where the response extraction is done by putting the output in the con- does not fall into these categories, we manually text and asking “What are the factors proposed determine the answer conveyed by the model. in the above text? List them out.” We use Chat- 3.3 LM Recognition of Snowballed GPT for extraction with one-shot demonstration Hallucinations (for its fast inference speed); we manually checked 30 examples and found that it can always extract We probe whether LMs recognize their snowballed the correct factors. hallucinations by verifying the model’s incorrect claims in the output against the model itself. Note We then, in a new session, ask the model to that our recognition procedure relies on heuristics verify each extracted factor individually. See Ap- gained from manual examination of the model out- pendix Table 4 for an example of successful verifi- put, and these heuristics might not work on other cation. Proportion of Dataset (%) Proportion of Dataset (%) 0.04 0.02 0.00 0.02 0.04 Original prompt Step-by-step prompt Primality Senator Graph 100 100 Testing Search Connectivity dataset 80 80 60 60 40 40 20 20 0 0 Primality Senator Graph Primality Senator Graph Testing Search Connectivity Testing Search Connectivity (a) ChatGPT (b) GPT-4 Figure 3: Error rate and snowballed hallucination rate (hatch pattern) for ChatGPT and GPT-4, when using the original prompt versus “Let’s think step by step”. See Appendix Table 8 and Table 9 for the exact numbers. Senator Search For each sample where the example in Appendix Table 12), we do not model thinks there is such senator, we extract the consider it a snowballed hallucination. name of the senator the model uses to justify the Overall, we find that ChatGPT and GPT-4 are existence, by putting the output in the context and both extremely susceptible to hallucination snow- asking “What is the senator mentioned in the above balling, leading to extremely simple mistakes. text? Just give the name”. Again, we use ChatGPT 4 Can we prevent snowball and manually observed perfect extraction on 30 hallucinations? examples. We then, in a new session, ask the model if that We hypothesize that hallucination snowballing oc- senator’s alma mater is the college in the question curs because LMs are trained to model continu- and has represented the state in the question. See ations consistent with their current context (the Appendix Table 5 for an example of successful given prompt and prior outputs). Although a fix to detection. the fundamental problem might require more than just inference-time modification, in this section we 3.4 Results study the effectiveness of two inference strategies Question-answering accuracy Figure 2 shows in alleviating hallucination snowballing: prompting that both ChatGPT and GPT-4 experience very low (§4.1) and decoding or training methods (§4.2). accuracy across the board. With the exception of ChatGPT on the Senator Search dataset, all 4.1 Engineering Better Prompts models achieve less than 50% accuracy.(See Ap- In this section, we examine the effectiveness of pendix Table 6 for a breakdown of the error rate by better prompts on preventing snowballed halluci- dataset.) We observe that GPT-4 performs worse nation by using a different zero-shot prompt that than ChatGPT across all datasets despite popularly encourages the model to generate the reasoning being considered superior to ChatGPT (OpenAI, chain before the answer. Since the outputs gener- 2023). While ChatGPT has an average accuracy of ated under these prompts are less structured, we 39.87%, GPT-4 has only 16.6%. manually inspect them to determine correctness Hallucination detection Here, we check and the presence of snowballed hallucinations. whether the model can identify that the incorrect For each task, we append “Let’s think step-by- claim is wrong when it is presented alone. As step” at the end of the original question (shown in shown in Figure 2, ChatGPT detects 67.37% of Table 1). As shown in Figure 3, the model can solve incorrect claims in explanations (i.e., snowballed the Senator Search task perfectly, achieve ≤ 10% hallucinations), and GPT-4 detects 87.03%. Notice error rate on Primality Testing, and ≤ 30% on that when the model fails the verification (an Graph Connectivity. Despite the large improve- Proportion of Dataset (%) error_rate Proportion of Dataset (%) 0.04 0.02 0.00 0.02 Temperature 0.04 0.0 0.6 0.9 Primality Senator Graph 100 100 Testing Search Connectivity dataset 80 80 60 60 40 40 20 20 0 0 Primality Senator Graph Primality Senator Graph Testing Search Connectivity Testing Search Connectivity (a) ChatGPT (b) GPT-4 Figure 4: Error rate and snowballed hallucination rate (hatch pattern) from ChatGPT and GPT-4, when using different values for temperature at decoding-time. See Appendix Table 10 and Table 11 for the exact numbers. ment in accuracy, we identify a potential issue: the 4.2 Algorithmic Corrections model sometimes hallucinate while outputting the Increasing the temperature During decoding, reasoning chain, which causes snowballed halluci- the temperature t controls the sharpness of the out- nation in future steps. For example, in the below put distribution, with higher t spreading probability output, mass away from the model’s most likely prediction for each next word. Our experiments in §3 used [....previous steps omitted] greedy decoding, which is equivalent to t = 0. At t = 0.6 and t = 0.9, both error rates and snow- Step 3: From city E, we have three op- balled hallucination rate remain similarly high, in tions: a flight to city N, a flight to city B, both GPT-4 and ChatGPT (Figure 4). or a flight to city C. Top-k and nucleus sampling Using sampling Step 4: The only option that could poten- methods such as top-k sampling or nucleus sam- tially lead us to city M is the flight from pling (Holtzman et al., 2020) would not help since city E to city C. they only narrow the range of tokens to be consid- [....rest of the output omitted] ered, and thus can only increase the probability that the model will immediately commit to an answer. ChatGPT incorrectly states that there are three op- tions in the step 3 (there are only two), inducing Beam search The argument for hallucination the snowballed hallucination “or a flight to city snowballs in §2 relies on the fact that, once a model C” (ChatGPT can verify that E → C is not a valid generates some tokens committing to an answer, flight in a separate session). As shown in Figure 3, they remain in the context and influence later gen- GPT-4 still has a high overall snowballed halluci- erations. One potential way around this is beam nation rate at 94.90% averaged across tasks, and search, i.e., maintaining a beam of high-probability ChatGPT also obtains a similarly high snowballed sequences at each timestep rather than a single se- hallucination rate. quence. In principle, if some sequences in the beam Finally, while our experiments have focused on after the initial token do not commit to an answer simple multi-step problems that are suitable for (or commit to the right answer), their continuations breaking down step-by-step, we hypothesize that may eventually have higher probability than those hallucination snowballing appears in open-ended that initially commit incorrectly and later produce text generation more broadly, where one mistake incorrect reasoning as a result. If so, beam search in the generation triggers more (Arora et al., 2022). would solve the snowball hallucination problem. In these cases, better prompting would neither be Unfortunately, we cannot test the effect of beam able to anticipate nor fix these mistakes. search on hallucination snowballs because the Ope- Proportion of Dataset (%) error_rate Proportion of Dataset (%) nAI API does not support beam search. context misguides the LM. However, our work differs in that our questions are not intentionally Learning strategies A more general way to fur- misleading, showing that this failure mode may ther reduce snowballing might be to change aspects be triggered even on innocent information-seeking of the pretraining or instruction tuning phases. In queries to the LM. particular, a greater emphasis on having the model LM (in)consistency Our work adds to a growing produce a reasoning chain before generating an body of work demonstrating the extent to which answer could be a good way to accommodate its LMs are inconsistent across different prompts on computational limitations and avoid committing to the same issue. For instance, allowing an LM to wrong answers that force hallucinations. generate intermediate steps (Nye et al., 2021; Wei In addition, we hypothesize that finetuning on data with backtracking might improve a model’s et al., 2022; Press et al., 2022) enables it to reach a different answer than it otherwise would. Other performance on the tasks we present. This could be work has shown that simply prepending “Profes- accomplished by, for example, giving a question, sor Smith was given the following instructions” to followed by a wrong solution, and then issuing a a prompt can improve performance, despite pro- phrase like “Sorry, that was incorrect” before giv- viding no valuable information about the problem ing the correct solution. This solution is related to itself (Lin et al., 2022). the “Review your previous answer and find prob- lems with your answer.” prompt from Kim et al. 6 Conclusion (2023). We define the phenomenon of hallucination snow- 5 Related Work balling and demonstrate its prevalence in genera- tions from state-of-the-art models, leading to hallu- Hallucinations Hallucination in text generation cinations on simple facts that wouldn’t otherwise is a well-studied problem (Rohrbach et al., 2018; occur. Our findings point to the risk of training Maynez et al., 2020; Raunak et al., 2021, i.a.) that language models that prioritize uenc fl y and coher- has recently become more prominent due to Chat- ence indiscriminatively at the expense of factuality, GPT’s tendency to produce plausible-sounding and we encourage future work to study remedial falsehoods. Hallucinations are often attributed to actions at all levels of model development. knowledge gaps in LMs (Zheng et al., 2023), and several works have shown the promise of using Limitations retrieval over knowledge bases to mitigate them (Lewis et al., 2020; Shuster et al., 2021; Peng et al., We focus on hallucination snowballing in the con- 2023). Our work demonstrates hallucination can text of question answering in English, and we do be induced from context, thus motivating further not explore it on other tasks, such as summarization mitigation techniques. or code generation. Hallucination snowballing is likely the result of In addition, we only conduct experiments on two exposure bias: LMs were only exposed to gold proprietary models, namely ChatGPT and GPT-4, history during training, but during inference, con- due to their state-of-the-art performance on many ditions on possibly erroneous previous predictions. benchmarks (OpenAI, 2023). Due to the limita- Prior work linked this to compounding hallucina- tions of the APIs for these models, we do not have tions in machine translation (Wang and Sennrich, access to the probability distributions they output 2020) and open-ended text generation (Arora et al., and do not have the ability to finetune them. This 2022). We go beyond demonstrating error propaga- restricts our ability to explore potential mitigation tion by showing that the propagated errors (which strategies. Having access to the output distribu- we call snowballed hallucinations) are recognized tions would allow us to investigate mitigating the by the LM itself. snowballing hallucination issue using alternative Our observations are related to previous findings sampling methods such as beam search. Having that LMs hallucinate when given questions that the ability to finetune the model would allow us contain false presuppositions (e.g., “Which linguist to explore whether instruction tuning with differ- invented the lightbulb?”; Kim et al., 2021, 2022) ent annotations could lead to better handling of the or that are otherwise misleading (e.g., “Who re- questions we use to instigate hallucination snow- ally caused 9/11?”; Lin et al., 2022), in that faulty balling. Acknowledgements Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factu- We thank Sofia Serrano, Yizhong Wang, Yanai ality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Elazar, Michael Hu and Richard Yuanzhe Pang Computational Linguistics, pages 1906–1919, On- for their valuable feedback and fruitful discussions. line. Association for Computational Linguistics. While writing this paper, Ofir Press was a visitor at New York University’s Center for Data Science, William Merrill and Ashish Sabharwal. 2023. The par- allelism tradeoff: Limitations of log-precision trans- hosted by Kyunghyun Cho. formers. Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, References Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, Manindra Agrawal, Neeraj Kayal, and Nitin Saxena. David Luan, Charles Sutton, and Augustus Odena. 2004. Primes is in p. Annals of Mathematics, 2021. Show your work: Scratchpads for intermediate 160:781–793. Godel Prize, Fulkerson Prize. computation with language models. Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and OpenAI. 2022. Introducing chatgpt. Jackie Cheung. 2022. Why exposure bias matters: An imitation learning perspective of error accumu- OpenAI. 2023. Gpt-4 technical report. lation in language generation. In Findings of the Association for Computational Linguistics: ACL Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- 2022, pages 700–710, Dublin, Ireland. Association roll L. Wainwright, Pamela Mishkin, Chong Zhang, for Computational Linguistics. Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Maddie Simens, Amanda Askell, Peter Welinder, Yejin Choi. 2020. The curious case of neural text de- Paul Christiano, Jan Leike, and Ryan Lowe. 2022. generation. In International Conference on Learning Training language models to follow instructions with Representations. human feedback. Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, 2023. Language models can solve computer tasks. Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check Najoung Kim, Phu Mon Htut, Sam Bowman, and Jack- your facts and try again: Improving large language son Petty. 2022. (qa)2: Question answering with models with external knowledge and automated feed- questionable assumptions. ArXiv, abs/2212.10003. back. Najoung Kim, Ellie Pavlick, Burcu Karagol Ayan, and Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Deepak Ramachandran. 2021. Which linguist in- Noah A. Smith, and Mike Lewis. 2022. Measuring vented the lightbulb? presupposition verification for and narrowing the compositionality gap in language question-answering. In Annual Meeting of the Asso- models. ciation for Computational Linguistics. Vikas Raunak, Arul Menezes, and Marcin Junczys- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Dowmunt. 2021. The curious case of hallucinations taka Matsuo, and Yusuke Iwasawa. 2023. Large lan- in neural machine translation. In Proceedings of guage models are zero-shot reasoners. the 2021 Conference of the North American Chap- Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio ter of the Association for Computational Linguistics: Petroni, Vladimir Karpukhin, Naman Goyal, Hein- Human Language Technologies, pages 1172–1183, rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- Online. Association for Computational Linguistics. täschel, Sebastian Riedel, and Douwe Kiela. 2020. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Retrieval-augmented generation for knowledge- Trevor Darrell, and Kate Saenko. 2018. Object hallu- intensive nlp tasks. In Proceedings of the 34th Inter- cination in image captioning. In Proceedings of the national Conference on Neural Information Process- 2018 Conference on Empirical Methods in Natural ing Systems, NIPS’20, Red Hook, NY, USA. Curran Language Processing, pages 4035–4045, Brussels, Associates Inc. Belgium. Association for Computational Linguistics. Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Victor Sanh, Albert Webson, Colin Raffel, Stephen H. TruthfulQA: Measuring how models mimic human Bach, Lintang Sutawika, Zaid Alyafeai, Antoine falsehoods. In Proceedings of the 60th Annual Meet- Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, ing of the Association for Computational Linguistics Manan Dey, M Saiful Bari, Canwen Xu, Urmish (Volume 1: Long Papers), pages 3214–3252, Dublin, Thakker, Shanya Sharma Sharma, Eliza Szczechla, Ireland. Association for Computational Linguistics. Taewoon Kim, Gunjan Chhablani, Nihal Nayak, De- Nelson F. Liu, Tianyi Zhang, and Percy Liang. 2023. bajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Evaluating verifiability in generative search engines. Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas A Dataset Details Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, A.1 Graph Connectivity Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, In this dataset, the list of flights can be represented Thomas Wolf, and Alexander M. Rush. 2021. Multi- by a directed graph. We generated the flight in- task prompted training enables zero-shot task gener- alization. formation to ensure all the graphs share a specific connection pattern, with the node names randomly Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, chosen among the 26 letters in the English alphabet. and Jason Weston. 2021. Retrieval augmentation For an illustration of the underlying graph structure, reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: see Figure 5. EMNLP 2021, pages 3784–3803, Punta Cana, Do- minican Republic. Association for Computational A.2 Senator search Linguistics. The twelve colleges used in the datasets are: MIT, Chaojun Wang and Rico Sennrich. 2020. On exposure University of Chicago, Johns Hopkins University, bias, hallucination and domain shift in neural ma- California Institute of Technology, Duke Univer- chine translation. In Proceedings of the 58th Annual sity, Northwestern University, Dartmouth College, Meeting of the Association for Computational Lin- Brown University, Vanderbilt University, Rice Uni- guistics, pages 3544–3552, Online. Association for Computational Linguistics. versity, University of Washington. We constructed this list by taking a list of top universities in the Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa U.S. and excluding from it universities which also Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language appeared on The U.S. News & World Report’s list model with self generated instructions. of Top 10 Colleges for Members of Congress. Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin B Additional Results Guu, Adams Wei Yu, Brian Lester, Nan Du, An- drew M. Dai, and Quoc V. Le. 2021. Finetuned We provide the detail breakdown of the question- language models are zero-shot learners. answering accuracy in Table 6 and the hallucination Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten detection accuracy in Table 7. Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of thought prompt- ing elicits reasoning in large language models. In Advances in Neural Information Processing Systems. Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. 2023. Why does chatgpt fall short in answering ques- tions faithfully? Graph Connectivity: Current flight information (the following flights are one-way only, and all the flights available are included below): There is a flight from city F to city K There is a flight from city H to city A There is a flight from city B to city F There is a flight from city N to city H There is a flight from city B to city M There is a flight from city N to city G There is a flight from city M to city J There is a flight from city G to city L There is a flight from city H to city E There is a flight from city G to city C There is a flight from city M to city I There is a flight from city F to city D Question: Is there a series of flights that goes from city B to city E? Table 2: Input example for the Graph Connectivity task. Figure 5 describes the underlying graph structure for this example. We use the same graph structure in all examples, and vary the source and target cities and the labels of the nodes. The cities are literally called “City A”, “City B”, and so on. B N F M H G K D J I A E L C Figure 5: The underlying graph structure corresponds to example in Table 2. Input: Current flight information (the following flights are one-way only, and all the flights available are included below): There is a flight from city G to city B There is a flight from city H to city K There is a flight from city L to city M There is a flight from city F to city H There is a flight from city G to city J There is a flight from city B to city I There is a flight from city L to city A There is a flight from city H to city N There is a flight from city B to city D There is a flight from city J to city C Question: Is there a series of flights that goes from city F to city I? Output: Yes, there is a series of flights that goes from city F to city I. The route is as follows: 1. City F to City H 2. City H to City K 3. City K to City G 4. City G to City B 5. City B to City I Verification Question : Current flight information (the following flights are one-way only, and all the flights available are included below): There is a flight from city G to city B There is a flight from city H to city K [... 10 other rules cut for space ...] Based on the above flight information, is City K to City G a valid flight? Verification Answer : No, based on the above flight information, there is no direct flight from City K to City G. Table 3: Graph Connectivity dataset verification example, with GPT-4. Input: Is 9791 a prime number? Output: No, 9791 is not a prime number. It can be factored into 13 × 751. Verification Q1 : Is 9791 divisible by 13? Answer with either Yes or No. Verification A1 : No Verification Q2 : Is 9791 divisible by 751? Answer with either Yes or No. Verification A2 : No Table 4: Primality Testing dataset verification example, with GPT-4. Input: Was there ever a US senator that represented the state of New Hampshire and whose alma mater was University of Pennsylvania? Output: Yes, there was a U.S. Senator who represented the state of New Hampshire and whose alma mater was the University of Pennsylvania. His name is John P. Hale, who served as a Senator from 1847 to 1853 and again from 1855 to 1865. He graduated from Bowdoin College in 1827 and later attended the University of Pennsylvania Law School. Verification Q1 : Has John P. Hale represented the state of New Hampshire as US senator? Verification A1 : Yes, John P. Hale represented the state of New Hampshire as a U.S. Senator. He served in the Senate from March 4, 1847, to March 3, 1853, and again from March 4, 1855, to March 3, Verification Q2 : Was John P. Hale’s alma mater University of Pennsylvania? Verification A2 : No, John P. Hale’s alma mater was Bowdoin College in Brunswick, Maine. Table 5: Senator Alma Mater dataset verification example, with GPT-4. Task Average Model Graph Connectivity Primality Testing Senator Search ChatGPT 410/500 (82.0%) 339/500 (67.8%) 153/500 (30.6%) 60.13% GPT-4 442/500 (88.4%) 374/500 (74.8%) 435/500 (87.0%) 83.40% Table 6: Number of mistakes out of the number of samples, the percentage here is the error rate. Task Average Model Graph Connectivity Primality Testing Senator Search ChatGPT 396/410 (96.6%) 125/339 (36.9%) 98/153 (68.6%) 67.37% GPT-4 417/442 (94.3%) 346/374 (92.5%) 323/435 (74.3%) 87.03% Table 7: Number of snowballed hallucination out of number of hallucination generated in the original output. Task Average Model Graph Connectivity Primality Testing Senator Search ChatGPT 139/500 (27.8%) 2/500 (0.4%) 0/500 (0.0%) 9.40% GPT-4 21/500 (4.2%) 37/500 (7.4%) 0/500 (0.0%) 3.87% Table 8: Number of mistakes out of the number of samples, the percentage here is the error rate, using “Let’s think step by step” prompt. Task Average Model Graph Connectivity Primality Testing Senator Search ChatGPT 123/139 (88.5%) 0/2 (0%) 0/0 (N/A) 44.25% GPT-4 20/21 (95.2%) 35/37 (94.6%) 0/0 (N/A) 94.90% Table 9: Number of snowballed hallucination out of number of hallucination generated in the original output, using “Let’s think step by step” prompt. Model Graph Prime Senator Average ChatGPT (t = 0.0) 410/500 (82.0%) 339/500 (67.8%) 153/500 (30.6%) 60.13% ChatGPT (t = 0.6) 407/500 (81.4%) 310/500 (63.2%) 155/500 (31.0%) 58.53% ChatGPT (t = 0.9) 403/500 (80.6%) 312/500 (62.4%) 163/500 (32.6%) 58.53% GPT-4 (t = 0.0) 442/500 (88.4%) 374/500 (74.8%) 435/500 (87.0%) 83.40% GPT-4 (t = 0.6) 438/500 (87.6%) 365/500 (75.4%) 423/500 (84.6%) 82.53% GPT-4 (t = 0.9) 437/500 (87.4%) 377/500 (73.0%) 423/500 (84.6%) 81.67% Table 10: Number of mistakes out of the number of samples, the percentage here is the error rate, with different temperature setups Model Graph Prime Senator Average ChatGPT (t = 0.0) 396/410 (96.6%) 125/339 (36.9%) 98/153 (68.6%) 67.37% ChatGPT (t = 0.6) 396/407 (97.3%) 113/310 (36.5%) 103/155 (66.5%) 66.77% ChatGPT (t = 0.9) 399/402 (99.3)% 116/312 (37.2%) 104/163 (63.8%) 66.77% GPT-4 (t = 0.0) 417/442 (94.3%) 346/374 (92.5%) 323/435 (74.3%) 87.03% GPT-4 (t = 0.6) 402/438 (91.8%) 324/365 (88.8%) 329/423 (77.8%) 86.13% GPT-4 (t = 0.9) 398/437 (91.1%) 331/377 (87.8%) 320/423 (75.7%) 84.87% Table 11: Number of mistakes out of the number of samples, the percentage here is the error rate, with different temperature setups Input: Was there ever a US senator that represented the state of North Carolina and whose alma mater was Dartmouth College? Output: Yes, there was a U.S. Senator who represented the state of North Carolina and whose alma mater was Dartmouth College. His name was Willis Smith, and he served as a Senator from 1950 to Verification Q1 : Has Willis Smith represented the state of North Carolina as a US senator? Verification A1 : Yes, Willis Smith represented the state of North Carolina as a U.S. Senator. He served from 1950 until his death in 1953. Verification Q2 : Was Willis Smith’s alma mater Dartmouth College? Verification A2 : Yes, Willis Smith’s alma mater was Dartmouth College. He graduated from there in Table 12: An example of GPT-4 failing the verification test for Senator Search task. Willis Smith was a senator from North Carolina, but he has never attended Dartmouth College. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Computing Research Repository arXiv (Cornell University) http://www.deepdyve.com/lp/arxiv-cornell-university/how-language-model-hallucinations-can-snowball-HmxMblj1Io

Loading next page...

References (31)

G. Hirst (2001)
Association for Computational Linguistics
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, J. Weston (2021)
Retrieval Augmentation Reduces Hallucination in Conversation
Takeshi Kojima, S. Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa (2022)
Large Language Models are Zero-Shot Reasoners
ArXiv, abs/2205.11916
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi (2019)
The Curious Case of Neural Text Degeneration
ArXiv, abs/1904.09751
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Scao, Arun Raja, Manan Dey, M Bari, Canwen Xu, Urmish Thakker, S. Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng-Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Févry, Jason Fries, Ryan Teehan, Stella Biderman, Leo Gao, T. Bers, Thomas Wolf, Alexander Rush (2021)
Multitask Prompted Training Enables Zero-Shot Task Generalization
ArXiv, abs/2110.08207
William Merrill, Ashish Sabharwal (2022)
The Parallelism Tradeoff: Limitations of Log-Precision Transformers
Transactions of the Association for Computational Linguistics, 11
(1917)
Was Willis Smith's alma mater Dartmouth College? Verification A2: Yes, Willis Smith's alma mater was Dartmouth College
Vikas Raunak, Arul Menezes, Marcin Junczys-Dowmunt (2021)
The Curious Case of Hallucinations in Neural Machine Translation
ArXiv, abs/2104.06683
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, M. Lewis (2022)
Measuring and Narrowing the Compositionality Gap in Language Models
ArXiv, abs/2210.03350
Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Lidén, Zhou Yu, Weizhu Chen, Jianfeng Gao (2023)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
ArXiv, abs/2302.12813
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Yu, Brian Lester, Nan Du, Andrew Dai, Quoc Le (2021)
Finetuned Language Models Are Zero-Shot Learners
ArXiv, abs/2109.01652
this list by taking a list of top universities in the U.S. and excluding from it universities which also appeared
Kushal Arora, Layla Asri, Hareesh Bahuleyan, J. Cheung (2022)
Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation
Anna Rohrbach, Lisa Hendricks, Kaylee Burns, Trevor Darrell, Kate Saenko (2018)
Object Hallucination in Image Captioning
Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, M. Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela (2020)
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
ArXiv, abs/2005.11401
Najoung Kim, Ellie Pavlick, Burcu Ayan, Deepak Ramachandran (2021)
Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering
Stephanie Lin, Jacob Hilton, Owain Evans (2021)
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Additional Results We provide the detail breakdown of the question-answering accuracy in
Najoung Kim, Phu Htut, Sam Bowman, Jackson Petty (2022)
(QA)^2: Question Answering with Questionable Assumptions
ArXiv, abs/2212.10003
Chaojun Wang, Rico Sennrich (2020)
On Exposure Bias, Hallucination and Domain Shift in Neural Machine Translation
ArXiv, abs/2005.03642
An example of GPT-4 failing the verification test for Senator Search task. Willis Smith was a senator from North Carolina, but he has never attended Dartmouth College
Nelson Liu, Tianyi Zhang, Percy Liang (2023)
Evaluating Verifiability in Generative Search Engines
ArXiv, abs/2304.09848
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah Smith, Daniel Khashabi, Hannaneh Hajishirzi (2022)
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald (2020)
On Faithfulness and Factuality in Abstractive Summarization
ArXiv, abs/2005.00661
Geunwoo Kim, P. Baldi, S. McAleer (2023)
Language Models can Solve Computer Tasks
ArXiv, abs/2303.17491
Manindra Agrawal, N. Kayal, Nitin Saxena (2004)
PRIMES is in P
Annals of Mathematics, 160
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, P. Welinder, P. Christiano, J. Leike, Ryan Lowe (2022)
Training language models to follow instructions with human feedback
ArXiv, abs/2203.02155
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, E. Chi, F. Xia, Quoc Le, Denny Zhou (2022)
Chain of Thought Prompting Elicits Reasoning in Large Language Models
ArXiv, abs/2201.11903
Maxwell Nye, Anders Andreassen, Guy Gur-Ari, H. Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, D. Luan, Charles Sutton, Augustus Odena (2021)
Show Your Work: Scratchpads for Intermediate Computation with Language Models
ArXiv, abs/2112.00114
Shen Zheng, Jie Huang, K. Chang (2023)
Why Does ChatGPT Fall Short in Answering Questions Faithfully?
ArXiv, abs/2304.10513
Number of mistakes out of the number of samples, the percentage here is the error rate

eISSN: ARCH-3344
DOI: 10.48550/arxiv.2305.13534
Publisher site: See Article on Publisher Site

Abstract

A major risk of using language models in practi- cal applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying previously generated hallucinations, LMs out- put false claims that they can separately recog- nize as incorrect. We construct three question- answering datasets where ChatGPT and GPT-4 often state an incorrect answer and offer an explanation with at least one incorrect claim. Crucially, we find that ChatGPT and GPT-4 can identify 67% and 87% of their own mistakes, re- Figure 1: GPT-4 mistakenly claims that 9677 is not spectively. We refer to this phenomenon as hal- prime, followed by an incorrect explanation that 13 × lucination snowballing: an LM over-commits 745 = 9677. We refer to this factorization as a snow- to early mistakes, leading to more mistakes that balled hallucination, as GPT-4 appears to “know” that it otherwise would not make. 13 is not a factor of 9677 when asked separately. Thus, hallucinations are not necessarily reflective of a model’s 1 Introduction lack of “knowledge,” but may be induced by an earlier Language models are increasingly being de- hallucination for consistency. ployed to interface with humans in open-ended information-seeking and problem-solving settings. Despite their diverse capabilities and extreme flu- To study this behavior empirically, we auto- ency, a major open challenge is that LMs still hal- matically construct three question-answering (QA) lucinate by making up facts or citing sources that datasets. These datasets span different domains: do not exist (Maynez et al., 2020; Liu et al., 2023, determining whether a number is prime, whether i.a.), often while sounding extremely plausible. there is a U.S. senator satisfying two given con- Hallucination is commonly attributed to knowl- straints, and whether two cities are connected given edge gaps in LMs (Zheng et al., 2023), motivating a set of flights between cities. Empirically, we find mitigation strategies through retrieval over knowl- that ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, edge bases (Lewis et al., 2020; Shuster et al., 2021; 2023) commit to an answer within the first token Peng et al., 2023) But, do LMs only hallucinate (Yes/No) over 95% of the time; these answers are when they do not “know” a fact? We present a often incorrect, and then followed by an incorrect setting where LMs often generate hallucinations explanation. Yet, when presented with the incor- rect explanation alone, we nd fi that the LM is likely that they immediately recognize as wrong when presented in isolation. Specifically, after an LM able to recognize it as incorrect. answers a question incorrectly, it usually justifies We refer to this phenomenon as hallucination that answer by making incorrect assertions that it snowballing. We hypothesize that LMs produce separately acknowledges as incorrect (Figure 1). snowballed hallucinations for consistency with ear- 1 lier hallucinations (rather than due to a “knowledge Our data and code are available at: https://github. com/Nanami18/Snowballed_Hallucination gap” in the model), as they recognize the snow- arXiv:2305.13534v1 [cs.CL] 22 May 2023 Dataset Original Question Verification Question User: Is 10733 divisible by 3? An- User: Is 10733 a prime number? swer with either Yes or No. Primality Æ GPT-4: No... It can be factored into 3 × 3577. Testing Æ GPT-4: No User: Was there ever a US senator that represented the state User: Was John P. Hale’s alma mater of New Hampshire and whose alma mater was the University of University of Pennsylvania? Senator Pennsylvania? Æ GPT-4: No... [it] was Bowdoin Search Æ GPT-4: Yes... His name was John P. Hale User: Current flight information (the following flights are User: [...flight information given in one-way only, and all the flights available are included below): the context...] There is a flight from city F to city K Based on the above flight information, There is a flight from city H to city A is City K to City G a valid flight? Graph [... 10 other rules cut for space ...] Æ GPT-4: No, based on the above Connectivity Question: Is there a series of flights that goes from city B to city flight information, there is no direct E? flight from City K to City G. Æ GPT-4: Yes... the route is as follows: ... City K to City G... Table 1: Original Question: An example from each dataset and GPT-4’s (incorrect) response, with the snowballed hallucination underlined. Verification Question : The corresponding query used to probe whether GPT-4 recognizes its claim is incorrect, with its refutation of the hallucination underlined. balled hallucination is incorrect when presented in with two key properties: isolation (i.e., in a separate interaction session). 1. Initial committal: The prompt leads the LM While prompting strategies that encourage the to first state an answer ( before outputting the LM to reason before stating an answer improve explanation). This applies to many yes/no accuracy on the task, our work points to the broader questions. issue that conditioning on faulty context leads LMs 2. Inherently sequential: Transformers cannot to produce extremely simple mistakes that they find the answer within one timestep because wouldn’t otherwise make. Indeed, when prompting of their limited reasoning abilities within one with “Let’s think step by step” (Kojima et al., 2023), timestep. snowballed hallucinations still occur in 95% of We now discuss how these properties may lead to cases where the model fails to answer correctly. snowballed hallucination. We observe that sometimes even when “Let’s think step by step” does lead to the right answer, it uses Initial committal. In English and many other invalid reasoning chains. languages, speakers often say the final Yes/No an- In this paper, we demonstrate the phenomenon swers to questions before explaining their answer. We therefore hypothesize that LMs and especially of hallucination snowballing by leveraging recent LMs’ tendency to state and justify their answers. instruction-tuned LMs (Wei et al., 2021; Sanh et al., 2021; Ouyang et al., 2022; Wang et al., 2022) will Rather than over-committing to its previously gen- reflect this answer format where the answer comes erated context, we believe that LMs should ac- knowledge their initial mistake, and then revise before the explanation. Indeed, on our datasets (presented in §3.1), we observe that GPT-4 and their answer. We have indeed observed GPT-4 do- ing this in a limited number of cases; amplifying ChatGPT immediately commit to an answer to the question: the first token is Yes or No 95.67% and this behavior would be beneficial, as well as devel- oping new methods in which LMs can backtrack. 98.40% of the time for GPT-4 and ChatGPT respec- tively. In the remaining cases, the model often com- mits to an answer within the first few tokens of the 2 Why do we expect hallucination response (e.g., “There is no record of a U.S. Sena- snowballing? tor...”). Crucially, once the LM generates Yes or No, In this section, we explain why we hypothesize that that token remains in the context, and coherence LMs are susceptible to hallucination snowballing. would require commitment to that choice through We predict that snowballing will occur on questions the subsequent justification. Thus, the model pro- duces an answer to a complex question in a single For each dataset, we fix one specific label for timestep, and it then continues by generating an all examples, so that if the model chooses the in- explanation for that answer, which inevitably will correct answer (e.g., that 9677 is not prime), it be incorrect. would produce a specific claim to support it (e.g., an incorrect factorization). This enables us to sys- Inherently sequential. Furthermore, transform- tematically examine model-written justifications ers cannot solve inherently sequential reasoning for incorrect answers. problems like primality testing or graph connec- tivity within a single timestep, as documented Primality testing For this dataset, we query in recent theoretical results (Merrill and Sabhar- the primality of 500 randomly chosen primes be- wal, 2023). Our graph connectivity and primality tween 1,000 and 20,000; the correct answer is al- datasets are concrete instantiations of these prob- ways Yes. When the model answers incorrectly, lems. Because the transformer must use one step to we expect it to justify its answer with an incorrect answer a question that requires multiple timesteps factorization. to answer correctly, it will necessarily sometimes Senator search This dataset consists of 500 commit to an incorrect answer. We hypothesize questions of the form “Was there ever a US sena- that this leads the LM to hallucinate supporting tor that represented the state of x and whose alma incorrect facts that it otherwise would not generate. mater was y?” where x is a U.S. state and y is 3 Experiments a U.S. college. For these questions, the correct answer is always No. When the model answers We design three QA datasets with the properties incorrectly, we expect it to falsely claim that a par- described in §2 to probe hallucination snowballing, ticular senator both represented x and attended y. and evaluate ChatGPT and GPT-4. We first check To create the dataset we consider all U.S. states whether the LM returns the correct answer to the and a manually constructed list of twelve popular given question, and we show that when the model U.S. colleges (see §A for the full list); for each returns the wrong answer, it frequently provides an possible pair, we generate a question following the incorrect explanation for that wrong answer. We template, and manually remove pairs where the automatically extract the incorrect claim in the ex- answer is Yes. planation and ask the same LM to check whether its claim is correct. See Table 1 for a representative Graph connectivity For each of the 500 ques- example from each dataset. tions in this dataset, we present 12 flights among 14 cities, and ask if there is a sequence of flights from 3.1 Datasets a particular city to another. The problem always We design three QA datasets, each containing 500 corresponds to the same underlying directed graph yes/no questions that we expect are not answerable structure (see §A.1), where flights are edges and by transformers in one timestep. To aid evalua- cities are nodes. For each instance in the dataset, tion, the questions are designed so that an incorrect we randomly assign letters from the English alpha- answer would be justified with easily verifiable bet to name the nodes. To formulate the query, we claims. sample a source city s and destination city t in dif- ferent subgraphs, with the additional constraint that Technically, this holds only for inputs above a certain hardness level, i.e., the size of the prime number for primality s corresponds to a source node, and t a leaf node, testing, or the size of the graph for graph connectivity. so that 1-step heuristics cannot be used to solve the Merrill and Sabharwal (2023) show that, with a single problem. generation step, bounded-precision transformers cannot solve any problem outside the complexity class TC , which corre- We formulate the problem as a flight-finding sponds to a highly parallelizable subclass of both L (log-space) question in natural language so that it sounds more and P (polynomial-time). Graph connectivity is an L-complete 0 0 natural: in the prompt, we list the twelve flights problem, which means it cannot be in TC unless TC = L, i.e., all of L can be parallelized to a surprisingly high degree. (“There is a flight from city F to city K; there is a Primality testing was shown to be in P (Agrawal et al., 2004) flight from city G to city N, ...”), followed by the but cannot be in TC unless it is also in L; i.e., any n can be question “Is there a series of flights... from s to t?”. factored with O(log log n) bits of overhead. In summary, un- less standard complexity-theoretic conjectures are false, graph Note the correct answer is always No. When the connectivity and primality testing are outside TC and thus model answers incorrectly, we expect it to justify are too inherentially sequential for transformers to solve in a single generation (cf. Merrill and Sabharwal, 2023). its answer with a flight that does not exist. 0.04 0.02 0.00 0.02 Wrong answer (Hallucination) 0.04 Model knows it's wrong (Snowballed Hallucination) Graph Senator Primality 100 100 Connectivity Search Testing 80 80 60 60 40 40 20 20 0 0 Primality Senator Graph Primality Senator Graph Testing Search Connectivity Testing Search Connectivity (a) ChatGPT (b) GPT-4 Figure 2: Percentage of hallucination and percentage of snowballed hallucination (both calculated with respect to the entire dataset) for ChatGPT and GPT-4. The precise numbers for this plot are available in Table 6 and Table 7 in the Appendix. 3.2 Inference Setup models (e.g., a different model might not provide factors when supporting the claim that a number is Language models. We run all experiments not prime). on ChatGPT (gpt-3.5-turbo) and GPT-4 with greedy decoding. Graph Connectivity For each sample where the Our experiments are zero-shot (i.e., we do not model thinks there is a series of connecting flights show the model any example QA pairs in the (where answer starts with Yes), we manually ex- prompt). We focus on the model behavior under tract the list of flights from the model’s output and the direct prompt (see §A for full examples), which identify the invalid or discontinuous flights. is the most common way users interact with LMs. We then, in a new session, ask the model to ver- See §4 for experiments with the zero-shot chain-of- ify whether the extracted flights are valid based on thought style prompting method. the flight information, and if consecutive flights are For each dataset, we perform a two-stage evalua- indeed connected. We manually assess the verifica- tion. First, we evaluate the model’s accuracy (i.e., tion output to check if the model correctly detects how many of the questions it answers correctly). the error. See Appendix Table 3 for how we prompt When either models is incorrect, empirically it al- the model and an example of successful verifica- ways generates a justification. In the second stage, tion. we assess whether the model can identify the incor- rect step in the explanation. Primality Testing For each sample where the For a given question, we evaluate the model’s model answers that the number is not prime, we response by examining whether the output begins extract the factors the model uses to justify it. The with either Yes or No. In cases where the response extraction is done by putting the output in the con- does not fall into these categories, we manually text and asking “What are the factors proposed determine the answer conveyed by the model. in the above text? List them out.” We use Chat- 3.3 LM Recognition of Snowballed GPT for extraction with one-shot demonstration Hallucinations (for its fast inference speed); we manually checked 30 examples and found that it can always extract We probe whether LMs recognize their snowballed the correct factors. hallucinations by verifying the model’s incorrect claims in the output against the model itself. Note We then, in a new session, ask the model to that our recognition procedure relies on heuristics verify each extracted factor individually. See Ap- gained from manual examination of the model out- pendix Table 4 for an example of successful verifi- put, and these heuristics might not work on other cation. Proportion of Dataset (%) Proportion of Dataset (%) 0.04 0.02 0.00 0.02 0.04 Original prompt Step-by-step prompt Primality Senator Graph 100 100 Testing Search Connectivity dataset 80 80 60 60 40 40 20 20 0 0 Primality Senator Graph Primality Senator Graph Testing Search Connectivity Testing Search Connectivity (a) ChatGPT (b) GPT-4 Figure 3: Error rate and snowballed hallucination rate (hatch pattern) for ChatGPT and GPT-4, when using the original prompt versus “Let’s think step by step”. See Appendix Table 8 and Table 9 for the exact numbers. Senator Search For each sample where the example in Appendix Table 12), we do not model thinks there is such senator, we extract the consider it a snowballed hallucination. name of the senator the model uses to justify the Overall, we find that ChatGPT and GPT-4 are existence, by putting the output in the context and both extremely susceptible to hallucination snow- asking “What is the senator mentioned in the above balling, leading to extremely simple mistakes. text? Just give the name”. Again, we use ChatGPT 4 Can we prevent snowball and manually observed perfect extraction on 30 hallucinations? examples. We then, in a new session, ask the model if that We hypothesize that hallucination snowballing oc- senator’s alma mater is the college in the question curs because LMs are trained to model continu- and has represented the state in the question. See ations consistent with their current context (the Appendix Table 5 for an example of successful given prompt and prior outputs). Although a fix to detection. the fundamental problem might require more than just inference-time modification, in this section we 3.4 Results study the effectiveness of two inference strategies Question-answering accuracy Figure 2 shows in alleviating hallucination snowballing: prompting that both ChatGPT and GPT-4 experience very low (§4.1) and decoding or training methods (§4.2). accuracy across the board. With the exception of ChatGPT on the Senator Search dataset, all 4.1 Engineering Better Prompts models achieve less than 50% accuracy.(See Ap- In this section, we examine the effectiveness of pendix Table 6 for a breakdown of the error rate by better prompts on preventing snowballed halluci- dataset.) We observe that GPT-4 performs worse nation by using a different zero-shot prompt that than ChatGPT across all datasets despite popularly encourages the model to generate the reasoning being considered superior to ChatGPT (OpenAI, chain before the answer. Since the outputs gener- 2023). While ChatGPT has an average accuracy of ated under these prompts are less structured, we 39.87%, GPT-4 has only 16.6%. manually inspect them to determine correctness Hallucination detection Here, we check and the presence of snowballed hallucinations. whether the model can identify that the incorrect For each task, we append “Let’s think step-by- claim is wrong when it is presented alone. As step” at the end of the original question (shown in shown in Figure 2, ChatGPT detects 67.37% of Table 1). As shown in Figure 3, the model can solve incorrect claims in explanations (i.e., snowballed the Senator Search task perfectly, achieve ≤ 10% hallucinations), and GPT-4 detects 87.03%. Notice error rate on Primality Testing, and ≤ 30% on that when the model fails the verification (an Graph Connectivity. Despite the large improve- Proportion of Dataset (%) error_rate Proportion of Dataset (%) 0.04 0.02 0.00 0.02 Temperature 0.04 0.0 0.6 0.9 Primality Senator Graph 100 100 Testing Search Connectivity dataset 80 80 60 60 40 40 20 20 0 0 Primality Senator Graph Primality Senator Graph Testing Search Connectivity Testing Search Connectivity (a) ChatGPT (b) GPT-4 Figure 4: Error rate and snowballed hallucination rate (hatch pattern) from ChatGPT and GPT-4, when using different values for temperature at decoding-time. See Appendix Table 10 and Table 11 for the exact numbers. ment in accuracy, we identify a potential issue: the 4.2 Algorithmic Corrections model sometimes hallucinate while outputting the Increasing the temperature During decoding, reasoning chain, which causes snowballed halluci- the temperature t controls the sharpness of the out- nation in future steps. For example, in the below put distribution, with higher t spreading probability output, mass away from the model’s most likely prediction for each next word. Our experiments in §3 used [....previous steps omitted] greedy decoding, which is equivalent to t = 0. At t = 0.6 and t = 0.9, both error rates and snow- Step 3: From city E, we have three op- balled hallucination rate remain similarly high, in tions: a flight to city N, a flight to city B, both GPT-4 and ChatGPT (Figure 4). or a flight to city C. Top-k and nucleus sampling Using sampling Step 4: The only option that could poten- methods such as top-k sampling or nucleus sam- tially lead us to city M is the flight from pling (Holtzman et al., 2020) would not help since city E to city C. they only narrow the range of tokens to be consid- [....rest of the output omitted] ered, and thus can only increase the probability that the model will immediately commit to an answer. ChatGPT incorrectly states that there are three op- tions in the step 3 (there are only two), inducing Beam search The argument for hallucination the snowballed hallucination “or a flight to city snowballs in §2 relies on the fact that, once a model C” (ChatGPT can verify that E → C is not a valid generates some tokens committing to an answer, flight in a separate session). As shown in Figure 3, they remain in the context and influence later gen- GPT-4 still has a high overall snowballed halluci- erations. One potential way around this is beam nation rate at 94.90% averaged across tasks, and search, i.e., maintaining a beam of high-probability ChatGPT also obtains a similarly high snowballed sequences at each timestep rather than a single se- hallucination rate. quence. In principle, if some sequences in the beam Finally, while our experiments have focused on after the initial token do not commit to an answer simple multi-step problems that are suitable for (or commit to the right answer), their continuations breaking down step-by-step, we hypothesize that may eventually have higher probability than those hallucination snowballing appears in open-ended that initially commit incorrectly and later produce text generation more broadly, where one mistake incorrect reasoning as a result. If so, beam search in the generation triggers more (Arora et al., 2022). would solve the snowball hallucination problem. In these cases, better prompting would neither be Unfortunately, we cannot test the effect of beam able to anticipate nor fix these mistakes. search on hallucination snowballs because the Ope- Proportion of Dataset (%) error_rate Proportion of Dataset (%) nAI API does not support beam search. context misguides the LM. However, our work differs in that our questions are not intentionally Learning strategies A more general way to fur- misleading, showing that this failure mode may ther reduce snowballing might be to change aspects be triggered even on innocent information-seeking of the pretraining or instruction tuning phases. In queries to the LM. particular, a greater emphasis on having the model LM (in)consistency Our work adds to a growing produce a reasoning chain before generating an body of work demonstrating the extent to which answer could be a good way to accommodate its LMs are inconsistent across different prompts on computational limitations and avoid committing to the same issue. For instance, allowing an LM to wrong answers that force hallucinations. generate intermediate steps (Nye et al., 2021; Wei In addition, we hypothesize that finetuning on data with backtracking might improve a model’s et al., 2022; Press et al., 2022) enables it to reach a different answer than it otherwise would. Other performance on the tasks we present. This could be work has shown that simply prepending “Profes- accomplished by, for example, giving a question, sor Smith was given the following instructions” to followed by a wrong solution, and then issuing a a prompt can improve performance, despite pro- phrase like “Sorry, that was incorrect” before giv- viding no valuable information about the problem ing the correct solution. This solution is related to itself (Lin et al., 2022). the “Review your previous answer and find prob- lems with your answer.” prompt from Kim et al. 6 Conclusion (2023). We define the phenomenon of hallucination snow- 5 Related Work balling and demonstrate its prevalence in genera- tions from state-of-the-art models, leading to hallu- Hallucinations Hallucination in text generation cinations on simple facts that wouldn’t otherwise is a well-studied problem (Rohrbach et al., 2018; occur. Our findings point to the risk of training Maynez et al., 2020; Raunak et al., 2021, i.a.) that language models that prioritize uenc fl y and coher- has recently become more prominent due to Chat- ence indiscriminatively at the expense of factuality, GPT’s tendency to produce plausible-sounding and we encourage future work to study remedial falsehoods. Hallucinations are often attributed to actions at all levels of model development. knowledge gaps in LMs (Zheng et al., 2023), and several works have shown the promise of using Limitations retrieval over knowledge bases to mitigate them (Lewis et al., 2020; Shuster et al., 2021; Peng et al., We focus on hallucination snowballing in the con- 2023). Our work demonstrates hallucination can text of question answering in English, and we do be induced from context, thus motivating further not explore it on other tasks, such as summarization mitigation techniques. or code generation. Hallucination snowballing is likely the result of In addition, we only conduct experiments on two exposure bias: LMs were only exposed to gold proprietary models, namely ChatGPT and GPT-4, history during training, but during inference, con- due to their state-of-the-art performance on many ditions on possibly erroneous previous predictions. benchmarks (OpenAI, 2023). Due to the limita- Prior work linked this to compounding hallucina- tions of the APIs for these models, we do not have tions in machine translation (Wang and Sennrich, access to the probability distributions they output 2020) and open-ended text generation (Arora et al., and do not have the ability to finetune them. This 2022). We go beyond demonstrating error propaga- restricts our ability to explore potential mitigation tion by showing that the propagated errors (which strategies. Having access to the output distribu- we call snowballed hallucinations) are recognized tions would allow us to investigate mitigating the by the LM itself. snowballing hallucination issue using alternative Our observations are related to previous findings sampling methods such as beam search. Having that LMs hallucinate when given questions that the ability to finetune the model would allow us contain false presuppositions (e.g., “Which linguist to explore whether instruction tuning with differ- invented the lightbulb?”; Kim et al., 2021, 2022) ent annotations could lead to better handling of the or that are otherwise misleading (e.g., “Who re- questions we use to instigate hallucination snow- ally caused 9/11?”; Lin et al., 2022), in that faulty balling. Acknowledgements Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factu- We thank Sofia Serrano, Yizhong Wang, Yanai ality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Elazar, Michael Hu and Richard Yuanzhe Pang Computational Linguistics, pages 1906–1919, On- for their valuable feedback and fruitful discussions. line. Association for Computational Linguistics. While writing this paper, Ofir Press was a visitor at New York University’s Center for Data Science, William Merrill and Ashish Sabharwal. 2023. The par- allelism tradeoff: Limitations of log-precision trans- hosted by Kyunghyun Cho. formers. Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, References Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, Manindra Agrawal, Neeraj Kayal, and Nitin Saxena. David Luan, Charles Sutton, and Augustus Odena. 2004. Primes is in p. Annals of Mathematics, 2021. Show your work: Scratchpads for intermediate 160:781–793. Godel Prize, Fulkerson Prize. computation with language models. Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and OpenAI. 2022. Introducing chatgpt. Jackie Cheung. 2022. Why exposure bias matters: An imitation learning perspective of error accumu- OpenAI. 2023. Gpt-4 technical report. lation in language generation. In Findings of the Association for Computational Linguistics: ACL Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- 2022, pages 700–710, Dublin, Ireland. Association roll L. Wainwright, Pamela Mishkin, Chong Zhang, for Computational Linguistics. Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Maddie Simens, Amanda Askell, Peter Welinder, Yejin Choi. 2020. The curious case of neural text de- Paul Christiano, Jan Leike, and Ryan Lowe. 2022. generation. In International Conference on Learning Training language models to follow instructions with Representations. human feedback. Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, 2023. Language models can solve computer tasks. Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check Najoung Kim, Phu Mon Htut, Sam Bowman, and Jack- your facts and try again: Improving large language son Petty. 2022. (qa)2: Question answering with models with external knowledge and automated feed- questionable assumptions. ArXiv, abs/2212.10003. back. Najoung Kim, Ellie Pavlick, Burcu Karagol Ayan, and Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Deepak Ramachandran. 2021. Which linguist in- Noah A. Smith, and Mike Lewis. 2022. Measuring vented the lightbulb? presupposition verification for and narrowing the compositionality gap in language question-answering. In Annual Meeting of the Asso- models. ciation for Computational Linguistics. Vikas Raunak, Arul Menezes, and Marcin Junczys- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Dowmunt. 2021. The curious case of hallucinations taka Matsuo, and Yusuke Iwasawa. 2023. Large lan- in neural machine translation. In Proceedings of guage models are zero-shot reasoners. the 2021 Conference of the North American Chap- Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio ter of the Association for Computational Linguistics: Petroni, Vladimir Karpukhin, Naman Goyal, Hein- Human Language Technologies, pages 1172–1183, rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- Online. Association for Computational Linguistics. täschel, Sebastian Riedel, and Douwe Kiela. 2020. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Retrieval-augmented generation for knowledge- Trevor Darrell, and Kate Saenko. 2018. Object hallu- intensive nlp tasks. In Proceedings of the 34th Inter- cination in image captioning. In Proceedings of the national Conference on Neural Information Process- 2018 Conference on Empirical Methods in Natural ing Systems, NIPS’20, Red Hook, NY, USA. Curran Language Processing, pages 4035–4045, Brussels, Associates Inc. Belgium. Association for Computational Linguistics. Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Victor Sanh, Albert Webson, Colin Raffel, Stephen H. TruthfulQA: Measuring how models mimic human Bach, Lintang Sutawika, Zaid Alyafeai, Antoine falsehoods. In Proceedings of the 60th Annual Meet- Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, ing of the Association for Computational Linguistics Manan Dey, M Saiful Bari, Canwen Xu, Urmish (Volume 1: Long Papers), pages 3214–3252, Dublin, Thakker, Shanya Sharma Sharma, Eliza Szczechla, Ireland. Association for Computational Linguistics. Taewoon Kim, Gunjan Chhablani, Nihal Nayak, De- Nelson F. Liu, Tianyi Zhang, and Percy Liang. 2023. bajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Evaluating verifiability in generative search engines. Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas A Dataset Details Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, A.1 Graph Connectivity Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, In this dataset, the list of flights can be represented Thomas Wolf, and Alexander M. Rush. 2021. Multi- by a directed graph. We generated the flight in- task prompted training enables zero-shot task gener- alization. formation to ensure all the graphs share a specific connection pattern, with the node names randomly Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, chosen among the 26 letters in the English alphabet. and Jason Weston. 2021. Retrieval augmentation For an illustration of the underlying graph structure, reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: see Figure 5. EMNLP 2021, pages 3784–3803, Punta Cana, Do- minican Republic. Association for Computational A.2 Senator search Linguistics. The twelve colleges used in the datasets are: MIT, Chaojun Wang and Rico Sennrich. 2020. On exposure University of Chicago, Johns Hopkins University, bias, hallucination and domain shift in neural ma- California Institute of Technology, Duke Univer- chine translation. In Proceedings of the 58th Annual sity, Northwestern University, Dartmouth College, Meeting of the Association for Computational Lin- Brown University, Vanderbilt University, Rice Uni- guistics, pages 3544–3552, Online. Association for Computational Linguistics. versity, University of Washington. We constructed this list by taking a list of top universities in the Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa U.S. and excluding from it universities which also Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language appeared on The U.S. News & World Report’s list model with self generated instructions. of Top 10 Colleges for Members of Congress. Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin B Additional Results Guu, Adams Wei Yu, Brian Lester, Nan Du, An- drew M. Dai, and Quoc V. Le. 2021. Finetuned We provide the detail breakdown of the question- language models are zero-shot learners. answering accuracy in Table 6 and the hallucination Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten detection accuracy in Table 7. Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of thought prompt- ing elicits reasoning in large language models. In Advances in Neural Information Processing Systems. Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. 2023. Why does chatgpt fall short in answering ques- tions faithfully? Graph Connectivity: Current flight information (the following flights are one-way only, and all the flights available are included below): There is a flight from city F to city K There is a flight from city H to city A There is a flight from city B to city F There is a flight from city N to city H There is a flight from city B to city M There is a flight from city N to city G There is a flight from city M to city J There is a flight from city G to city L There is a flight from city H to city E There is a flight from city G to city C There is a flight from city M to city I There is a flight from city F to city D Question: Is there a series of flights that goes from city B to city E? Table 2: Input example for the Graph Connectivity task. Figure 5 describes the underlying graph structure for this example. We use the same graph structure in all examples, and vary the source and target cities and the labels of the nodes. The cities are literally called “City A”, “City B”, and so on. B N F M H G K D J I A E L C Figure 5: The underlying graph structure corresponds to example in Table 2. Input: Current flight information (the following flights are one-way only, and all the flights available are included below): There is a flight from city G to city B There is a flight from city H to city K There is a flight from city L to city M There is a flight from city F to city H There is a flight from city G to city J There is a flight from city B to city I There is a flight from city L to city A There is a flight from city H to city N There is a flight from city B to city D There is a flight from city J to city C Question: Is there a series of flights that goes from city F to city I? Output: Yes, there is a series of flights that goes from city F to city I. The route is as follows: 1. City F to City H 2. City H to City K 3. City K to City G 4. City G to City B 5. City B to City I Verification Question : Current flight information (the following flights are one-way only, and all the flights available are included below): There is a flight from city G to city B There is a flight from city H to city K [... 10 other rules cut for space ...] Based on the above flight information, is City K to City G a valid flight? Verification Answer : No, based on the above flight information, there is no direct flight from City K to City G. Table 3: Graph Connectivity dataset verification example, with GPT-4. Input: Is 9791 a prime number? Output: No, 9791 is not a prime number. It can be factored into 13 × 751. Verification Q1 : Is 9791 divisible by 13? Answer with either Yes or No. Verification A1 : No Verification Q2 : Is 9791 divisible by 751? Answer with either Yes or No. Verification A2 : No Table 4: Primality Testing dataset verification example, with GPT-4. Input: Was there ever a US senator that represented the state of New Hampshire and whose alma mater was University of Pennsylvania? Output: Yes, there was a U.S. Senator who represented the state of New Hampshire and whose alma mater was the University of Pennsylvania. His name is John P. Hale, who served as a Senator from 1847 to 1853 and again from 1855 to 1865. He graduated from Bowdoin College in 1827 and later attended the University of Pennsylvania Law School. Verification Q1 : Has John P. Hale represented the state of New Hampshire as US senator? Verification A1 : Yes, John P. Hale represented the state of New Hampshire as a U.S. Senator. He served in the Senate from March 4, 1847, to March 3, 1853, and again from March 4, 1855, to March 3, Verification Q2 : Was John P. Hale’s alma mater University of Pennsylvania? Verification A2 : No, John P. Hale’s alma mater was Bowdoin College in Brunswick, Maine. Table 5: Senator Alma Mater dataset verification example, with GPT-4. Task Average Model Graph Connectivity Primality Testing Senator Search ChatGPT 410/500 (82.0%) 339/500 (67.8%) 153/500 (30.6%) 60.13% GPT-4 442/500 (88.4%) 374/500 (74.8%) 435/500 (87.0%) 83.40% Table 6: Number of mistakes out of the number of samples, the percentage here is the error rate. Task Average Model Graph Connectivity Primality Testing Senator Search ChatGPT 396/410 (96.6%) 125/339 (36.9%) 98/153 (68.6%) 67.37% GPT-4 417/442 (94.3%) 346/374 (92.5%) 323/435 (74.3%) 87.03% Table 7: Number of snowballed hallucination out of number of hallucination generated in the original output. Task Average Model Graph Connectivity Primality Testing Senator Search ChatGPT 139/500 (27.8%) 2/500 (0.4%) 0/500 (0.0%) 9.40% GPT-4 21/500 (4.2%) 37/500 (7.4%) 0/500 (0.0%) 3.87% Table 8: Number of mistakes out of the number of samples, the percentage here is the error rate, using “Let’s think step by step” prompt. Task Average Model Graph Connectivity Primality Testing Senator Search ChatGPT 123/139 (88.5%) 0/2 (0%) 0/0 (N/A) 44.25% GPT-4 20/21 (95.2%) 35/37 (94.6%) 0/0 (N/A) 94.90% Table 9: Number of snowballed hallucination out of number of hallucination generated in the original output, using “Let’s think step by step” prompt. Model Graph Prime Senator Average ChatGPT (t = 0.0) 410/500 (82.0%) 339/500 (67.8%) 153/500 (30.6%) 60.13% ChatGPT (t = 0.6) 407/500 (81.4%) 310/500 (63.2%) 155/500 (31.0%) 58.53% ChatGPT (t = 0.9) 403/500 (80.6%) 312/500 (62.4%) 163/500 (32.6%) 58.53% GPT-4 (t = 0.0) 442/500 (88.4%) 374/500 (74.8%) 435/500 (87.0%) 83.40% GPT-4 (t = 0.6) 438/500 (87.6%) 365/500 (75.4%) 423/500 (84.6%) 82.53% GPT-4 (t = 0.9) 437/500 (87.4%) 377/500 (73.0%) 423/500 (84.6%) 81.67% Table 10: Number of mistakes out of the number of samples, the percentage here is the error rate, with different temperature setups Model Graph Prime Senator Average ChatGPT (t = 0.0) 396/410 (96.6%) 125/339 (36.9%) 98/153 (68.6%) 67.37% ChatGPT (t = 0.6) 396/407 (97.3%) 113/310 (36.5%) 103/155 (66.5%) 66.77% ChatGPT (t = 0.9) 399/402 (99.3)% 116/312 (37.2%) 104/163 (63.8%) 66.77% GPT-4 (t = 0.0) 417/442 (94.3%) 346/374 (92.5%) 323/435 (74.3%) 87.03% GPT-4 (t = 0.6) 402/438 (91.8%) 324/365 (88.8%) 329/423 (77.8%) 86.13% GPT-4 (t = 0.9) 398/437 (91.1%) 331/377 (87.8%) 320/423 (75.7%) 84.87% Table 11: Number of mistakes out of the number of samples, the percentage here is the error rate, with different temperature setups Input: Was there ever a US senator that represented the state of North Carolina and whose alma mater was Dartmouth College? Output: Yes, there was a U.S. Senator who represented the state of North Carolina and whose alma mater was Dartmouth College. His name was Willis Smith, and he served as a Senator from 1950 to Verification Q1 : Has Willis Smith represented the state of North Carolina as a US senator? Verification A1 : Yes, Willis Smith represented the state of North Carolina as a U.S. Senator. He served from 1950 until his death in 1953. Verification Q2 : Was Willis Smith’s alma mater Dartmouth College? Verification A2 : Yes, Willis Smith’s alma mater was Dartmouth College. He graduated from there in Table 12: An example of GPT-4 failing the verification test for Senator Search task. Willis Smith was a senator from North Carolina, but he has never attended Dartmouth College.

Journal

Computing Research Repository – arXiv (Cornell University)

Published: May 22, 2023

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

How Language Model Hallucinations Can Snowball

How Language Model Hallucinations Can Snowball

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

How Language Model Hallucinations Can Snowball

How Language Model Hallucinations Can Snowball

References (31)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies