Research

On Reality and the Limits of Language Data: Aligning LLMs with Human Norms
Nigel Collier Nigel Collier

On Reality and the Limits of Language Data: Aligning LLMs with Human Norms

Now for historical interest: our 2023 study found that language-trained AI struggled with real-world common-sense reasoning. New 2024-25 benchmarks confirm that even multimodal models still falter on spatial physical tasks and object affordances. Ground-truth world modelling remains a frontier but aligning AI with human-scale embodied knowledge is still vital for safe applications.

Read More
LoGU: Long-form Generation with Uncertainty Expressions
Nigel Collier Nigel Collier

LoGU: Long-form Generation with Uncertainty Expressions

This paper studies how to reduce hallucinations when large language models generate long answers with multiple claims. We propose Long-form Generation with Uncertainty, where models explicitly mark uncertain parts of their responses. Using new training data, supervised fine-tuning, and direct preference optimization, we improve factual accuracy while keeping explanations detailed, readable, and clear about knowledge gaps.

Read More
Time to Revisit Exact Match
Nigel Collier Nigel Collier

Time to Revisit Exact Match

Large language models sometimes struggle with temporal understanding, yet traditional “exact match” metrics hide these errors or mis-rank systems. This paper introduces better numeric measures that capture how wrong a model is - improving our understanding of model limitations and preventing misplaced trust in real-world use.

Read More
Trident: Benchmarking llm safety in finance, medicine, and law
Nigel Collier Nigel Collier

Trident: Benchmarking llm safety in finance, medicine, and law

As AI models enter high-stakes domains such as law, finance and healthcare, this work references clear safety principles drawn from professional ethics and introduces Trident-Bench, a new benchmark to test how well large language models adhere to them. We evaluate 19 models and find that while strong generalists (e.g., GPT, Gemini) pass basic checks, domain-specialist models often fail to comply with policies, underlining the urgent need for targeted safety evaluations.

Read More
Beyond the final layer: Intermediate representations for better multilingual calibration in large language models
Nigel Collier Nigel Collier

Beyond the final layer: Intermediate representations for better multilingual calibration in large language models

This paper tackles the blind-spot of confidence calibration in multilingual large language models: it shows that non-English languages are far worse calibrated than English, and finds that intermediate layers, not the final layer, offer much better confidence signals. Building on this, we introduce Language-Aware Confidence Ensemble (LACE), a training-free method that adaptively selects the best layers per language.

Read More