Adversarial robustness of large language model (LLM) safety

Developing an efficient automatic attack model to improve the evaluations and training of LLMs, making them safer and more robust.

Catalyst Project | April 11, 2026

Abstract background with geometric shapes

Assessing the vulnerabilities of LLMs has become a key area of AI safety research. Canada CIFAR AI Chair Gauthier Gidel proposes a novel, more efficient and automated way of finding vulnerabilities in LLMs. By using optimization and borrowing methods from image-based adversarial attacks, the project aims to provide an efficient automatic attack model. This will allow model developers to improve the evaluations and training of LLMs, assessing their vulnerability and making them safer and more robust.

Collaborators

Gauthier Gidel
Canada CIFAR AI Chair, Mila; Université de Montréal

Related Research

Catalyst Project

Adversarial robustness of large language model (LLM) safety

Collaborators

Related Research

Performative Empathy and Deceptive Alignment

Repetition, Resistance, and Reinforcement: Longitudinal Effects of Conversational AI on Political Attitudes

Safe autonomous chemistry labs