SIMBENCH: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

SimBench sets a new standard for evaluating AI as a mirror of human behaviours, uniting 20 diverse datasets to reveal when model simulations succeed, fail, and why that matters.

See the full article here on arXiv.

Previous
Previous

Navigating the Alignment-Calibration Trade-off: A Pareto-SuperiorFrontier via Model Merging

Next
Next

UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation