Are Doubt and Uncertainty the Same Thing?
Uncertainty estimation has become one of the hottest topics in LLM research. But while reading recent work on calibration and confidence scoring, I found myself asking what feels like a fundamental question: are doubt and uncertainty actually the same thing? As LLMs increasingly support scientific, legal, and medical workflows, that distinction is becoming practically important. I’m not a cognitive scientist, but what I’ve learned suggests that doubt involves something far richer than what today’s models are capable of representing.
OpenAI’s recent research on leaderboard evaluations [1] offers one view on why models often behave in ways that resemble uncertainty without genuinely exhibiting it: models are trained to be good test-takers. When uncertain, they’ve learned that guessing often yields better scores than abstaining. Like a student facing an exam, leaving a question blank guarantees zero points; a guess might be rewarded.
Humans experience doubt very differently. Doubt is a metacognitive act, a deliberate suspension of judgment while we evaluate what we know, what we suspect, and where we might be wrong. When reviewing a scientific paper, let’s say you can’t find the sample size, you don’t just lower a confidence score that n=50 or 35, you hold multiple competing hypotheses in mind while seeking missing information. Did I misread the methods section? Am I confusing the full sample size with just the control group? Was this given in the supplementary materials rather than the body of the paper? This act of postponing a decision serves as a cognitive safeguard against premature conclusions.
LLMs operate in a fundamentally different way. Their token probabilities blend multiple forms of uncertainty, aleatoric (ambiguity in the prompt) and epistemic (gaps in available information). When a model produces hedging language, such as “might” or “possibly”, we often interpret these as signs of doubt. But the crucial difference is this: models lack an explicit, self-referential representation of ignorance or doubt. They have internal signals correlated with uncertainty, but no unified, introspective system that treats those signals as ‘me not knowing’ in the way humans do. Those hedges are learned linguistic patterns, not genuine epistemic warnings.
OpenAI’s analysis shows that modern evaluation benchmarks actively penalize uncertainty: in nine of the ten major benchmarks examined, “I don’t know” receives little or no credit at all. This potentially creates an incentive structure throughout the model development pipeline, from pre-training through evaluation: guessing is systematically rewarded over abstention.
Recent reasoning-focused models like o1, GPT-5.1, and Claude Opus have made progress. They reliably state when information is missing and can simulate evidence-checking. Yet the fundamental mechanism remains unchanged: they still lack introspective access to their own knowledge states. Their apparent caution is learned behaviour, not cognitive awareness.
This produces a strange paradox: the better models become at expressing uncertainty linguistically, the harder it becomes to distinguish genuine gaps in knowledge from well-calibrated language generation.
Without due care, the distinction between doubt and uncertainty could give rise to practical consequences across critical domains:
Medical literature reviews: a model might confidently interpolate missing dosage information, embedding fabrication within an otherwise accurate summary.
Legal research: models might generate plausible-sounding but nonexistent case precedents when the prompt implies they should exist.
Scientific peer review: models might mix genuine insights with hallucinated concerns, all expressed with identical confidence.
Deployment studies [4] reveal further challenges: uncertainty methods that look robust in controlled evaluations can break under adversarial prompts, and are sensitive to typos or conversational history. What appears well-calibrated in the lab becomes fragile in the wild.
Researchers are pursuing several promising approaches. Confidence-elicitation methods aim to extract calibrated probabilities; chain-of-thought prompting can expose reasoning gaps; selective prediction allows abstention when uncertainty is high. Semantic entropy, introduced by Farquhar et al. [2], clusters outputs by meaning rather than tokens, distinguishing “different ways to say the same thing” from “different things to say,” revealing hidden hallucinations. Yet models also exhibit CHOKE (Certain Hallucinations Overriding Known Evidence) [6] despite knowing the correct answer, adversarial perturbations can distort uncertainty signals and cause a hallucinated answer to be generated.
The abstention literature [3] offers a different framing: instead of estimating uncertainty, perhaps models should refuse to answer beyond their knowledge. But abstention still requires recognising ignorance.
Recent work testing o1 models in critical care scenarios [5] highlights another vulnerability: gap-closing cues. When a clinical vignette contained a detail that appears to resolve uncertainty, such as a known outcome, a tight temporal sequence, or an apparently obvious diagnosis, the model flips confidently from one conclusion to another. Degany et al. show that even advanced reasoning collapses when the framing gives the illusion of certainty. In other cases, such as framing biases or mathematical equivalences, the same reasoning capabilities help the model recognise that differently phrased information is functionally identical. The authors connect these patterns to bounded rationality: when faced with incomplete information, even sophisticated reasoning models “satisfice”, choosing an acceptable answer rather than fully exploring alternatives.
Humans don’t seem to be perfect doubters either. We often collapse to premature certainty, but we at least have a process that can, in principle, flag ‘I should slow down here.’ LLMs lack an analogous, explicit meta-level check.
We find ourselves in a peculiar intermediate stage. Models are sophisticated enough to sound epistemically responsible, but not yet capable of being epistemically trustworthy. They can perform doubt without experiencing it.
This discussion points to several interesting research questions:
How do we create benchmarks that reward appropriate doubt?
How might future architectures explicitly represent knowledge states and their absence?
How can we prevent models from being deceived by gap-closing cues that create false certainty?
References
[1] Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why Language Models Hallucinate. arXiv preprint arXiv:2509.04664. https://arxiv.org/abs/2509.04664
[2] Farquhar, S., Kossen, J., Kuhn, L., & Gal, Y. (2024). Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630, 625-630. https://doi.org/10.1038/s41586-024-07421-0
[3] Wen, B., Yao, J., Feng, S., Xu, C., Tsvetkov, Y., Howe, B., & Wang, L. L. (revised 2025). Know Your Limits: A Survey of Abstention in Large Language Models. arXiv preprint arXiv:2407.18418. https://arxiv.org/abs/2407.18418
[4] Bakman, Y. F., Yaldiz, D. N., Kang, S., Zhang, T., Buyukates, B., Avestimehr, S., & Karimireddy, S. P. (2025). Reconsidering LLM uncertainty estimation methods in the wild. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 29531-29556). https://aclanthology.org/2025.acl-long.1429/
[5] Degany, O., Laros, S., Idan, D., & Einav, S. (2025). Evaluating the o1 reasoning large language model for cognitive bias: a vignette study. Critical Care, 29, 376. https://doi.org/10.1186/s13054-025-05591-5
[6] Simhi, A., Itzhak, I., Barez, F., Stanovsky, G., & Belinkov, Y. (2025). Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs. arXiv preprint arXiv:2502.12964. https://arxiv.org/abs/2502.12964
Acknowledgements
Many thanks to Auss Abbood and Caiqi Zhang for their very thoughtful comments on an earlier version of this article.