Research
On Reality and the Limits of Language Data: Aligning LLMs with Human Norms
Now for historical interest: our 2023 study found that language-trained AI struggled with real-world common-sense reasoning. New 2024-25 benchmarks confirm that even multimodal models still falter on spatial physical tasks and object affordances. Ground-truth world modelling remains a frontier but aligning AI with human-scale embodied knowledge is still vital for safe applications.
LoGU: Long-form Generation with Uncertainty Expressions
This paper studies how to reduce hallucinations when large language models generate long answers with multiple claims. We propose Long-form Generation with Uncertainty, where models explicitly mark uncertain parts of their responses. Using new training data, supervised fine-tuning, and direct preference optimization, we improve factual accuracy while keeping explanations detailed, readable, and clear about knowledge gaps.
Time to Revisit Exact Match
Large language models sometimes struggle with temporal understanding, yet traditional “exact match” metrics hide these errors or mis-rank systems. This paper introduces better numeric measures that capture how wrong a model is - improving our understanding of model limitations and preventing misplaced trust in real-world use.
All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning
How can we be confident large language models are confident for the right reasons? Our EMNLP 2025 paper introduces training-free, graph-based confidence estimation for reasoning tasks, modeling LLM thought paths as directed graphs using centrality and convergence to improve reliability, interpretability, and downstream performance.
Trident: Benchmarking llm safety in finance, medicine, and law
As AI models enter high-stakes domains such as law, finance and healthcare, this work references clear safety principles drawn from professional ethics and introduces Trident-Bench, a new benchmark to test how well large language models adhere to them. We evaluate 19 models and find that while strong generalists (e.g., GPT, Gemini) pass basic checks, domain-specialist models often fail to comply with policies, underlining the urgent need for targeted safety evaluations.
Beyond the final layer: Intermediate representations for better multilingual calibration in large language models
This paper tackles the blind-spot of confidence calibration in multilingual large language models: it shows that non-English languages are far worse calibrated than English, and finds that intermediate layers, not the final layer, offer much better confidence signals. Building on this, we introduce Language-Aware Confidence Ensemble (LACE), a training-free method that adaptively selects the best layers per language.
PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation
PrivacyPAD trains a routing agent to decide which parts of a user’s prompt stay private and which are shared. It strikes a careful balance between data protection and performance, allowing users to safely benefit from powerful external models.
Navigating the Alignment-Calibration Trade-off: A Pareto-SuperiorFrontier via Model Merging
When AI models are tuned to follow human instructions, they pay an alignment tax - losing both accuracy, diversity and causing it to halucinate confidence. Merging tuned and base models can recover both, creating smarter, more calibrated AI.
SIMBENCH: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
SimBench sets a new standard for evaluating AI as a mirror of human behaviours, uniting 20 diverse datasets to reveal when model simulations succeed, fail, and why that matters.
UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation
Large language models often sound confident, even when wrong. This study benchmarks how they express uncertainty, helping researchers design models that reason, and admit doubt more like people do.