Nigel Collier 23/11/2025 Nigel Collier 23/11/2025

On Reality and the Limits of Language Data: Aligning LLMs with Human Norms

Now for historical interest: our 2023 study found that language-trained AI struggled with real-world common-sense reasoning. New 2024-25 benchmarks confirm that even multimodal models still falter on spatial physical tasks and object affordances. Ground-truth world modelling remains a frontier but aligning AI with human-scale embodied knowledge is still vital for safe applications.

Nigel Collier 20/11/2025 Nigel Collier 20/11/2025

LoGU: Long-form Generation with Uncertainty Expressions

This paper studies how to reduce hallucinations when large language models generate long answers with multiple claims. We propose Long-form Generation with Uncertainty, where models explicitly mark uncertain parts of their responses. Using new training data, supervised fine-tuning, and direct preference optimization, we improve factual accuracy while keeping explanations detailed, readable, and clear about knowledge gaps.

Nigel Collier 16/11/2025 Nigel Collier 16/11/2025

Time to Revisit Exact Match

Large language models sometimes struggle with temporal understanding, yet traditional “exact match” metrics hide these errors or mis-rank systems. This paper introduces better numeric measures that capture how wrong a model is - improving our understanding of model limitations and preventing misplaced trust in real-world use.

Nigel Collier 12/11/2025 Nigel Collier 12/11/2025

All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning

How can we be confident large language models are confident for the right reasons? Our EMNLP 2025 paper introduces training-free, graph-based confidence estimation for reasoning tasks, modeling LLM thought paths as directed graphs using centrality and convergence to improve reliability, interpretability, and downstream performance.

Nigel Collier 12/11/2025 Nigel Collier 12/11/2025

Trident: Benchmarking llm safety in finance, medicine, and law

As AI models enter high-stakes domains such as law, finance and healthcare, this work references clear safety principles drawn from professional ethics and introduces Trident-Bench, a new benchmark to test how well large language models adhere to them. We evaluate 19 models and find that while strong generalists (e.g., GPT, Gemini) pass basic checks, domain-specialist models often fail to comply with policies, underlining the urgent need for targeted safety evaluations.

Nigel Collier 12/11/2025 Nigel Collier 12/11/2025

Beyond the final layer: Intermediate representations for better multilingual calibration in large language models

This paper tackles the blind-spot of confidence calibration in multilingual large language models: it shows that non-English languages are far worse calibrated than English, and finds that intermediate layers, not the final layer, offer much better confidence signals. Building on this, we introduce Language-Aware Confidence Ensemble (LACE), a training-free method that adaptively selects the best layers per language.

Nigel Collier 12/11/2025 Nigel Collier 12/11/2025

PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation

PrivacyPAD trains a routing agent to decide which parts of a user’s prompt stay private and which are shared. It strikes a careful balance between data protection and performance, allowing users to safely benefit from powerful external models.

Nigel Collier 11/11/2025 Nigel Collier 11/11/2025

Navigating the Alignment-Calibration Trade-off: A Pareto-SuperiorFrontier via Model Merging

When AI models are tuned to follow human instructions, they pay an alignment tax - losing both accuracy, diversity and causing it to halucinate confidence. Merging tuned and base models can recover both, creating smarter, more calibrated AI.

Nigel Collier 11/11/2025 Nigel Collier 11/11/2025

SIMBENCH: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

SimBench sets a new standard for evaluating AI as a mirror of human behaviours, uniting 20 diverse datasets to reveal when model simulations succeed, fail, and why that matters.

Nigel Collier 10/11/2025 Nigel Collier 10/11/2025

UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation

Large language models often sound confident, even when wrong. This study benchmarks how they express uncertainty, helping researchers design models that reason, and admit doubt more like people do.

Research

On Reality and the Limits of Language Data: Aligning LLMs with Human Norms

LoGU: Long-form Generation with Uncertainty Expressions

Time to Revisit Exact Match

All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning

Trident: Benchmarking llm safety in finance, medicine, and law

Beyond the final layer: Intermediate representations for better multilingual calibration in large language models

PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation

Navigating the Alignment-Calibration Trade-off: A Pareto-SuperiorFrontier via Model Merging

SIMBENCH: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation

Nigel H. Collier