Why Language Models Hallucinate
- Source: https://arxiv.org/pdf/2509.04664
- Authors: Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang
- Date: 2025-09-04
- Clipped: 2026-03-09 (SGT)
Gist
This paper argues hallucinations are not mysterious failures but statistically predictable errors shaped by the modern LLM pipeline. The key claim is that both pretraining objectives and benchmark scoring conventions incentivize guessing under uncertainty rather than abstaining.
Key ideas
- Hallucinations are framed as a special case of generative errors and analyzed through learning theory.
- The paper introduces an Is-It-Valid (IIV) reduction: language generation error can be lower-bounded via a related binary-classification error.
- Main theoretical takeaway: if validity is hard to classify, hallucinations are expected even with clean training data.
- They extend the analysis to prompted settings and include uncertainty responses like IDK.
- For “arbitrary facts” (e.g., sparse personal facts), they connect hallucination risk to data sparsity (singleton/missing-mass intuition): rare facts are intrinsically high-risk.
- Post-training persistence is explained as an evaluation problem: many benchmarks use binary scoring that rewards guessing and penalizes abstention.
- They call this an “epidemic” of uncertainty-penalizing evaluations and argue that adding a few hallucination benchmarks is not enough.
- Proposed mitigation: modify mainstream evaluations to include explicit confidence targets / abstention-aware scoring (penalties for wrong answers, not just 0/1 accuracy).
Why it matters
- It shifts discussion from “just fix model internals” to objective + benchmark design.
- Suggests hallucination reduction requires socio-technical change (leaderboards, grading norms), not only better prompts/RAG.
- Gives a principled way to evaluate safer behavior: reward calibrated abstention when uncertainty is high.
Caveats
- Mostly theoretical framing + benchmark-policy proposal; limited broad empirical validation across many model families/tasks.
- Relies on stylized assumptions (plausible response sets, validity labeling, calibration terms) that may be harder to operationalize in messy real-world generation.