Sample-efficient online fine-tuning against resistant behaviors: statistical foundations for post-training alignment
Using a statistical framework to establish the effectivness of corrective training as a trustworthy safety mechanism.
Modern AI systems deployed in the real world often develop emergent misalignment (e.g., reward hacking, deceptive alignment) after deployment, an internal behavioural failure that causes them to deviate from their intended goals. Canada CIFAR AI Chair Linglong Kong proposes a statistical framework for sample-efficient online fine-tuning to establish whether corrective training can serve as a trustworthy safety mechanism or whether more fundamental safeguards are required.
Collaborators
Linglong Kong
Canada CIFAR AI Chair, Amii; University of Alberta
Related Research
AI Alignment Project

