Time to Revisit Exact Match

Large language models sometimes struggle with temporal understanding, yet traditional “exact match” metrics hide these errors or mis-rank systems. This paper introduces better numeric measures that capture how wrong a model is - improving our understanding of model limitations and preventing misplaced trust in real-world use.

See the full paper here on ACL.

Previous
Previous

LoGU: Long-form Generation with Uncertainty Expressions

Next
Next

All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning