UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation

Large language models often sound confident, even when wrong. This study benchmarks how they express uncertainty, helping researchers design models that reason, and admit doubt more like people do.

Read the full article here on arXiv.

Previous
Previous

SIMBENCH: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors