Sample-efficient online fine-tuning against resistant behaviors: statistical foundations for post-training alignment

Using a statistical framework to establish the effectivness of corrective training as a trustworthy safety mechanism.

AI Alignment Project | April 11, 2026

Modern AI systems deployed in the real world often develop emergent misalignment (e.g., reward hacking, deceptive alignment) after deployment, an internal behavioural failure that causes them to deviate from their intended goals. Canada CIFAR AI Chair Linglong Kong proposes a statistical framework for sample-efficient online fine-tuning to establish whether corrective training can serve as a trustworthy safety mechanism or whether more fundamental safeguards are required.

Collaborators

Linglong Kong
Canada CIFAR AI Chair, Amii; University of Alberta

Related Research

AI Alignment Project

A unified statistical framework for quantifying rare event risks for language models

AI Alignment Project

Game-theoretic safety guarantees for advanced AI systems

AI Alignment Project

Sample-efficient online fine-tuning against resistant behaviors: statistical foundations for post-training alignment

Collaborators

Related Research

A unified statistical framework for quantifying rare event risks for language models

Game-theoretic safety guarantees for advanced AI systems

Scaling laws, data distributions, and learning dynamics: simulated high-energy physics data as a benchmark for data in the wild