The ability of artificial intelligence (AI) to sift through mountains of information and deliver useful results is rapidly reshaping the way people learn, work, and handle numerous tasks. Yet, for all the convenience and value Generative AI and large language models (LLMs) deliver, they have a problem. Despite delivering text, video, and images that appear accurate and convincing, they sometimes hallucinate.
These fabrications—which can range from minor, plausible errors to utterly absurd assertions—are a legitimate cause for concern. At the very least, the resulting misinformation or botched image can rank as mildly amusing or annoying. In a worst-case scenario, however, LLMs can dispense bad medical or legal advice or lead to biased, discriminatory, or dangerous decision-making.
As a result, researchers and data scientists are actively exploring ways to rein in AI hallucinations through improved training and model refinements. One method, Retrieval Augmented Generation (RAG)a, cross-checks data using outside databases and other sources. Another approach relies on novel statistical methods to spot possible semantic glitches.b Still another plugs physical data and other information into a model.
At the same time, data scientists are introducing robust filters, parameter controls, and human feedback loops that can serve as guardrails. “The challenge is to produce useful and accurate results without undermining the ability of these models to support creative ideation and brainstorming,” said Chris Callison-Burch, a professor in the School of Engineering and Applied Science at the University of Pennsylvania.
Just the Facts
No one disputes the cause of the hallucination problem: the size of today’s Generative AI and LLM models, along with the way they work introduce opportunities for errors—sometimes on a grand scale. A model like OpenAI’s Chat GPT-4 tops out at more than 1.76 trillion parameters.c Data scientists build LLMs using a method called autoregressive generation.d “It uses a probability distribution to predict the next word,” Callison-Burch said.
In other words, autoregressive generation doesn’t set out to produce accurate information per se; it’s designed to spit out the most probable outcome based on past data and surrounding words. Larger models trained on high-quality and more-diverse datasets generally increase the odds for accurate results because they can look at greater numbers of words in context.
However, they can’t expunge hallucinations.
Plenty of mind-bending examples of AI hallucinations have appeared over the last few years. Among them: adding glue to pizzae to make toppings stick better, and eating rocks to promote better nutrition and digestive health.f Gen AI models produce extra fingers and sometimes eliminate body parts or key elements in an image. While it is easy for any reasonable person to dismiss the most egregious examples, things get far more complicated when LLMs veer into health and wellness adviceg and lawyers rely on them to prepare court briefs and cite case law.h
Stories of these AI systems going off the rails have become the new normal. Yet, for all the chaos, the cause is relatively simple. “Because a language model contains massive amounts of Internet data, it may faithfully represent a concept or the overall knowledge it acquired during training—even if the information isn’t accurate,” said Daphne Ippolito, an assistant professor in the Language Technologies Institute at Carnegie Mellon University.
The bottom line? As words become numeric vectors and numerical values multiply by an order of magnitude inside a modeli, things get fuzzy and meanings get blurred. The model simply regurgitates words based on probability. “AI hallucinations are a feature rather than a bug,” said Anima Anandkumar, Bren Professor of Computing and Mathematical Sciences at the California Institute of Technology (Caltech). “Models are incentivized to produce plausible text, rather than factual text.”
Model Behavior
Addressing the AI hallucination problem is no simple feat. Yet, several methods have emerged that take direct aim at the problem. For example, RAG prompts the LLM to check the Web or a proprietary database to verify that the information it generates is correct. This process can take place in real time. For instance, an LLM could check to see who now serves as the governor of a state, or the current price of gasoline, rather than relying on outdated training data.
RAG extracts the desired data, plugs it into the original query, and generates a new, more-accurate response. In some cases, the system undergoes an additional self-evaluation process.j It examines its answer to determine if it abides by parameters that make sense. If the system is uncertain about whether the information is correct, it can conduct additional searches to support or contradict the original response. This might include the use of specialized databases.
The result is a more comprehensive and accurate response, with fewer hallucinations. “Retrieval Augmented Generation doesn’t change the base LLM. It runs atop it by providing additional context in the prompt—using an API or similar tool to retrieve relevant information,” Callison-Burch explained.
Another promising method emerged in June 2024, when researchers at the University of Oxford in the U.K. published a paper in Nature outlining a novel way to detect AI hallucinations.k “If you ask an LLM the same question several times, you are likely to get different answers back,” explained Sebastian Farquhar, a Google DeepMind researcher who co-authored the paper as a Ph.D. student. “It can be extraordinarily difficult to determine whether a model is uncertain about what to say versus being uncertain about how to say it.”
The solution—referred to as semantic entropy—is rooted in the inherent vagueness of language and the numerous ways a model can interpret words, phrases, concepts, symbols, and other data. For example, the words “fly” and “bank” have multiple meanings, and a word like “thing” refers to almost any object. The Oxford researchers developed a method that uses probability and statistical analysis to score word patterns based on potential meaning versus correct phrasing.
The technique identifies likely errors and hallucinations through subtle inconsistencies in language that occur across different versions of a response. For example, the method found 45 LLM errors across 150 factual claims when it examined text generated from a group of Wikipedia biographies. The entropy approach works across six open source LLMs, including GPT-4 and LLaMA 2. It does not require any prior knowledge, specific training, or task-specific data and instructions to work across different types of datasets.
However, this statistical technique addresses only a particular type of hallucination, Farquhar emphasized. It is not effective in detecting errors when an LLM consistently spits out incorrect results. “There are a lot of ways a model can go wrong and spew garbage. This is just a way to address the problem of hallucination,” Farquhar said.
Active Imagination
To be sure, tamping down AI hallucinations will likely require multiple techniques—often used simultaneously or to address specific needs. At Caltech, Anandkumar is exploring ways to build LLMs that understand real-world properties. “One of the biggest problems with these models,” she said, “is that they lack physical grounding.” For example, “If you ask LLMs to play tennis, they provide a lot of theories about it, but they cannot actually play tennis.”
Keeping such limitations in mind, and avoiding the urge to anthropomorphize models, is vital. Consequently, Anandkumar and other researchers at Caltech are now constructing AI models that incorporate detailed physical data. These so-called neural operators span biology, chemistry, and physics.l They focus on everything from soil and plants to clouds and objects in outer space.
“We teach the models the laws of physics from the ground up,” Anandkumar said. “So, instead of training the model with Internet data of varying factuality, it is equipped to generate answers and simulate information that are physically valid and factually correct.” It’s possible to apply the technique to both classical and generative AI models.
Anandkumar also has developed an open-source toolkit, LeanDojo,m that plugs mathematical reasoning into LLM training.n This allows an LLM to avoid certain types of hallucinations by verifying every proof step that an LLM proposes. “The training process typically takes place over the course of a GPU-week and leads to 100% accuracy,” Anandkumar said.
Meanwhile, publicly available LLMs such as ChatGPT, Gemini, and Copilot are turning to still other methods to reduce hallucinations. They are introducing alignment training techniques that insert fact-checking and feedback from users into the training loop.o Some also tap a method called “Causal AI” that forces a system to examine a query or issue from multiple perspectives simultaneously; this increases the odds the LLM will expose a potential hallucination, but not release it.
The Dream of Better AI
The inherent complexity of large probabilistic AI models makes it nearly impossible to eradicate every hallucination or factual error. In addition, creators may want to prioritize certain functions, capabilities, and outcomes, which can lead to edge cases where hallucinations pop up. “The challenge is to build models that do things that different people desire. One person may want an answer that is wonky and creative, and another might want a reply that is factual,” Ippolito said.
Further complicating matters, hallucinations sometimes have value. They can lead to new and revolutionary ways to think—and see the world. Ippolito, who explored the topic in a Ph.D. dissertationp while attending the University of Pennsylvania, argued that it is important to avoid snuffing out the creative sparks that Generative AI and LLMs sometimes deliver. The same qualities that lead to hallucinations can help people produce new art, music, and product designs. Already, companies like Nvidia and Intel are using Generative AI and LLMs to produce new and more efficient chip designs.
For now, the march toward more lucid and accurate AI models continues. Although no single approach will solve the core problem—a base model that hallucinates—the sum of these techniques offer a way to get hallucinations under control. “There’s a clear need to improve LLMs,” Callison-Burch said. “But ultimately, humans must understand there’s a need to stay in control of the technology and learn to use it responsibly.”
Further Reading
-
Izacard, G., Lewis, P. Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, a., Riedel, S., and Grave, E.
Atlas: Few-shot Learning with Retrieval Augmented Language Models. November 16, 2022. https://arxiv.org/pdf/2208.03299 - Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y.
Detecting hallucinations in large language models using semantic entropy. Nature, June 19, 2024. Vol. 630, pages 625–630 (2024). https://www.nature.com/articles/s41586-024-07421-0 -
Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., and Fung, P.
Towards Mitigating Hallucination in Large Language Models via Self-Reflection, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843 December 6-10, 2023. https://aclanthology.org/2023.findings-emnlp.123.pdf - Yang, K., Swope, A.M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., and Anandkumar, A.
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. NeurIPS 2023, June 27, 2023. https://arxiv.org/abs/2306.15626 -
Wang, C., Zhou, H., Chang, K., Li, B., Mu, Y., Xiao, T., Liu, T., and Zhu, J.
Hybrid Alignment Training for Large Language Models. CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China, June 21, 2024. https://arxiv.org/html/2406.15178v1 - Ippolito, D.
Understanding the Limitations of Using Large Language Models for Text Generation, 2021. https://www.cis.upenn.edu/~ccb/publications/dissertations/daphne-ippolito-thesis.pdf