A unified statistical framework for quantifying rare event risks for language models

Quantifying rare-event problems using standard statistical tools to better compare models, set safety standards and measure model regression.

AI Alignment Project | April 11, 2026

One of the central challenges of AI alignment is quantifying extremely small failure rates – probabilities so low that ordinary testing will never observe them. ‘Long-tail failures,’ like jailbreaks, policy evasion or subtle safety violations, are where oversight is weakest, making it impossible to compare models, set safety standards or measure model regression. Canada CIFAR AI Chair Bei Jiang will address this issue using standard statistical tools designed for these rare-event problems, which are currently underused in large language model evaluation.

Collaborators

Bei Jiang
Canada CIFAR AI Chair, Amii; University of Alberta

Related Research

AI Alignment Project

Game-theoretic safety guarantees for advanced AI systems

AI Alignment Project

Sample-efficient online fine-tuning against resistant behaviors: statistical foundations for post-training alignment

AI Alignment Project

A unified statistical framework for quantifying rare event risks for language models

Collaborators

Related Research

Game-theoretic safety guarantees for advanced AI systems

Sample-efficient online fine-tuning against resistant behaviors: statistical foundations for post-training alignment

Scaling laws, data distributions, and learning dynamics: simulated high-energy physics data as a benchmark for data in the wild