Sample-efficient online fine-tuning against resistant behaviors: statistical foundations for post-training alignment

Using a statistical framework to establish the effectivness of corrective training as a trustworthy safety mechanism.

| April 11, 2026
Abstract background with flowing shapes

Modern AI systems deployed in the real world often develop emergent misalignment (e.g., reward hacking, deceptive alignment) after deployment, an internal behavioural failure that causes them to deviate from their intended goals. Canada CIFAR AI Chair Linglong Kong proposes a statistical framework for sample-efficient online fine-tuning to establish whether corrective training can serve as a trustworthy safety mechanism or whether more fundamental safeguards are required.

Collaborators

  • Linglong Kong

    Canada CIFAR AI Chair, Amii; University of Alberta